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Need 

Pbiicy  deveiopment  and  administrative  decision  making  in  the  public 
sector  often  require  the  collection  and  analysis  of  quantitative  dajba::  and, . 
tite  evaluation  of  analytic  results.    Consequently^  potential  inahagers 
and  administrators  require  a  solid  foundation  in  statistical  reasoning 
and  data  analytic  procedures  if  they  are  to  becdine  effective  practitioners. 
Most  curricula  in  public  management  or  administration  have  long  recognized 
this  need  and  have  included  an  elementary  statistics  course  in  their  re- 
quired sequence.    However,  the  ad  hoc  nature  of  most  policy  or  adminis- 
trative analyses,  the  low  quality  of  most  data  sources^  and  the  need  of 
policy  analysts  to  coammnicate  results  to  non-technical  audiences  suggest 
that  the  usual  introduction  to  statistics  may  not  be  optimal,    in  addition^ 
since  students  of  policy  management  rarely  ccs&plete  more  than  a  two  semester 
introductory  sequence  in  quantitative  methods,  the  course  must  be  compre- 
hensive, covering  advanced  as  veil  as  introductory  material.  Unfortunately, 
an  additional  problem  is  that  most  introductory  texts  in  statistics,  and 
therefore  most  introductory  courses,  are  oriented  towards  the  natural  or 
biological  sciences ,  covering  topics  and  developing  examples  of  relevance 
chiefly  to  these  disciplines. 

Quantitative  Methods  for  Public  Manag^ent  (^^M)  represents  a  break 
vith  the  traditional  approach.    The  course  contains  instructional  matertai 
covering  a  broad  xange  of  statistical  and  data  malytic  procedures  chbseii 
on  the  basis  of  their  probable  utility  to  public  managers  and  administrators. 
The  material  e^hasizes  graphics,  robust   procedures,   model  development, 
and  the  evaluation  and  critique  of  analyses.    Besides  a  specially  selected 
set  of  topics,  the  course  contains  data  derived  from  "real  world*'  policy 


mm 

relevant  situations.    All  examples^  exercises ^  and  exam  problems  derive 
from  actaal  ei^irical  situations  of  relevance  to  public  jpbllcy  managers 
and  administrators.    By  providing  relevant  contexts  for  the  development 
find  exercise  of  abstract  methods^  the  course  assures  a  deeper  and  more 
lasting  educational  experience  for  the  student  and  enhances  the  student's 
likelihood  of  successfully  mastering  these  methods.    Quantitative  evalua- 
tions of  the  educational  effectiveness  of  have  shown  that  it  possesses 
definite  advantages  over  traditional  approaches  (Lei^ardt  and  Wasserman^ 
forthcoming;  Leinhardt,  Leinhardt  and  Wasserman,  1977). 

Ggngral  Overvtew 

The  package  consistS  of  three  elements:     (1)  a  set  of  detailed  lecture 
outlines  and  supplemental  material  for  an  instructor;  (2)  a  set  of  reading 
assignments j  exercises^  and  exams  for  students:  and  (3)  a  computer  system 
for  performing  data  analysis  on  numerical  data  files. 

Instructors^  assumed  to  be  «perienced  at  teaching  statiistics  or 
quantitative  methods^  use  the  lecture  outlines  as  guides  in  the  preparation 
of  each  90  minute  lecture.    The  outlines  are  extensively  detailed  and 
organized  in  a  consistent  manner.    Learning  goals  and  presentation  acti- 
vities are  clearly  defined  and  presentation  aids  such  as  overhead  projector 
transparency  Blasters  are  keyed  directly  to  the  lecture  outline.    Since  many 
topics  covered  in  QMP^l  do  hot  appear  in  traditional  statistics  textbooks^ 
suggested  readings  are  specified  to  provide  instructors  with  a  guide  to 
background  material. 

Students  are  escpect  to  have  a  minimum  mathematical  preparation  of 
college  algebra.  Units  containing  more  advanced  mathematical  material 
(such  as  calculus)  are  preceded  by  prerequisite  inventories  intended  to 
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detect  weaknesses  in  itudent  preparation  and  to  serve  as  a  remedial  resource. 
Supplementary  material  for  reading  by  students  provides  ccpverage  of  items 
nhdse  mastery  is  prerequisite  to  mastery  of  unit  material.    Reading  assign- 
ments^ exercises^  and  examinatibhs  are  keyed  to  the  lecture  sequence  and 

are  designed  to  provide  students  with  textual  descriptions  of  methods^ 

^_  _  _    __  _  ;  _  ; 

relevant  examples  of  empirical  applications^  and  opportuhitieis  to  exercise 
newly  learned  skills  on  problems  whose  substance  is 'intellectually  inter*  : 
esting  and  of  a  cbnteznpbrary  nature.    Worked  solutions  to  problem  sets 
are  provided  so  that  feedback  to  the  student  can  be  rapid  and^  therefore^ 
educationally  effective. 

The  reading  assignments  for  students  refer  to  both  textbooks  in  methods 
and  academic  journals.    Several  texts  are  used  since ^  at  the  time  the  course 
was  designed,  no  single  text  existed  which  covered  all  the  topicis  represented 
in  QMPM.    Those  texts  which  are  heavily  read  should  be  purchased  while  others 
can  be  consulted  at  the  library.    Journal  articles  serve  the  purpose  of  ex- 
posing students  to  sttidiess  of  the  type  they  will  likely  have  to  read  and 
digest  in  performing  future  pirbfessibnal  activities.    By  and  large^  the 
selected  articles  are  reprinted  in  certain  edited  volumes  and  purchase 
bf  these  is  suggested.    Other  material  can  be  fbuhd  In  university  and 
college  libraries. 

A  computer  softt^re  system  is  available  to  provide  students  with  the 
opportunity  to  perform  numerous  data  analyses.    One  of  the  most  limiting 
features  bf  traditional  approaches  tb  the  teaching  of  data  analysis  is 
their  reliance  on  student  performance  bf  the  arithmetic  necessary  for  the 
completion  of  aii  exercise.    While  hand  calculators  have  facilitated  these 
bperatibns,  many  bf  the    procedures  cbvered  in  QMPM  require  elabbrate 
arithmetical  bperatibns  which  are  ardtibtis  tb  perfbrm  even  bn  advanced 
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hand  calculators.    Additionally^  effective  learning  of  data  analytic 
practices  requires  the  student  to  be  ready  to  try  several  approaches  to 
the  same  prbhlem  or  to  repeatedly  reanalyze  parts  of  a  problem.  Such 
experience  provides  the  student  with  illustrations  of  the  sensitivity  of 
analytic  results  to  the  methods  applied  with  practice  at  the  application 
of  similar  techniques  in  widely  differing  circumstances. 

Although  frequent  performance  of  analytic  activities  contributes  to 
learning  they  caii  burden  the  student  with  ah  ihorditiate  amount  of  repti-. 
tibtis  and  boring  hand  work.    The  cbaiputer  routine  (CMU-DAP)  ^  available  for 
use  with  the  OMPM  package^  cfeviates  this  activity  by  having  the  cbaiputer 
perform  the  arithmetical  bperatibiis.    The  system  is  designed  so  that 
operations  appear  natural,  i.e.^  no  prior  programming  experience  is  necessary 
The  rbutiiies  are  **cailed**  in  a  language  that  is  easily  Understbbd  aiid  em- 
ployed by  hbvices.    While  the  xnachihe  generates  graphics  and  perfbrms 
computations^  the  student  is  free  to  concentrate  bn  alternative  analytic 
strategies  or  the  evaluation  of  analytic  results.    Note  that  while  the 
computing  system  enhances  the  learning  experience,  it  Is  not  an  essential 
feature.    In  particular,  instructional  material  does  not  depend  on  its 
availability.    Other  commercially  available  systems  such  as  SPSS,  IBM 
STATPAK^  etc.,  contain  routines  for  performing  inany  of  the  procedures 
covered  in  OMPM  and  inay  be  substituted  for  CMU'-DAF.    Also,  tiev  and  planned 
texts  (e.g.,  McNeil,  1977,  and  Hoaglin  and  Vellenan,  in  preparation) 
provide  code  for  exploratory  techniques. 

Goals 

Q^M  is  designed  to  facilitate  the  education  of  public  managers  and 
administrators  in  contexiporary  data  analysis,  to  provide  them  with  skills 
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for  understanding  and  criticizing  analyaea  performed  by  dthersy  and  to 
provide  thein'vlttt  skilis  for  presenting  and  Interpretiixg:  teclmlcai  jnaterial 
to  non-^technlcal  audiences*    The  pedagogic  structurei  topic  drganizatlbhi 
atid  Instructional  inateriai  are  .designed  with  the  objective  of  prdvlding 
students*  with  a  deep  understanding  of  data  analytic  tnethdda  and  assuring 
that  a  hdLgh  propdrtidn  df  students  will  acquire  mastery  df  data  analytic 

Module  and  Unit  Contents 

:;MPM's  curricular  inateriai  is  divided  into  four  independent  modules 
t#hich  are  further  disaggregated  into  content  units  consisting  of  varying 
ix^iuhnxs  of  90  minute  lectures. 

Sii^ggstgd  SgQiiPtigj-iig 

A  year's  length  course  can  be  organized  by  either  following  the 
specified  sequence  of  modules  and  units  or  modifying  this  sequence  to 
fit  the  purposes  of  the  instructor.    The  designed  sequence  is  based  on 
a  hierarchical  development  of  skills  for  handling  increasingly  more 
complicated  data  sets.    Thus,  in  Module  1,  single  batches  of  data 
precede  multiple  batches ,  and  in  Module  II  regression  with  one  carrier 
precedes  regression  with  multiple  carriers.    Note,  though,  that  QMPM's 
topic  organization  is  non-traditional.    The  most  dramatic  deviation 
from  usual  sequencing  occurs  in  the  presentation  of  regression  as  a 
model  fitting  procedure  before  the  presentation  of  probability  notions. 
The  assumption  here  is  that  probability  and  inference  are  not  essential 
to  the  process  of  constructing  models i    Rather ^  they  speak  to  the  Issue 
of  selecting  best  fittiiig  models  or  estimating  parameter  values  in 
sampling  situations.     The. logic  behind  this  sequencing  is  discussed 
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in  Lelnhardt  and  Wasserman  (1977).    An  alternative  atid  iiiOre: traditionai 
approach  would  place  Module  III^  particularly  units  5  and  7,  Before 
Hbdtile  II.    Regression  could  then  be  covered  either  directly  after 
Kodtile  III  or  after  Module  IV.    Unit  8  can  occur  ahytime  after  Module  III 
but  Unit  9  should  not  precede  Module  III.    A  diagram  of  module  and  unit 
dependence  appears  below. 


Diagram  of  Module  and  Unit  Dependence 


Modules  are  circled  and  indicated  by  Roman  numerals;  Arabic  numerals 
refer  to  units.    Solid  lines  indicate  design  dependence;  dashed  ilnes 
indicate  alternatives  to  the  sequence  implicit  in  the  unit  numbers. 
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Eacb.  module  and  each  unit  is  a  complete  Instructibaal  package  and 
can,  therefore^  be  taught  independently  of  other  ^HPM  components.  How- 
ever, each  does  possess  a  set  of  prerequisites  which  are  often  covered 
In  preceding  components.     If  mastery  of  these  prerequisites  is  assured ^ 
prior  components  need  not  be  taught. 

Since  the  package  is  modularized^  components  can  be  used  to  create 
short  courses  focusing  on  specif ^  topics  or  as  part  of  in-service 
training  programs  that  off €5r  |i^Ve^|^tion  of  topics.    For  example,  a 
short  course  on  cbntemporary'^^^pratbry  data  analysis  could  be  composed 
of  units  1^  2i  3i  and  8.    A  aBcJft  course  on  analysis  of  contingency 

tables  could  be  based  on  units  5^  6^  7,  and  9.    A  i^ort  course  on  modern 

..      _  _  _  _____       _        :        .  v>i.  " 

data  analytic  graphics  could  be  developed  using  units  1  and  2.    A  short 

course  in  regression  could  be  based  on  units  3^        5^  and  6.    A  flow 

chart  of  th  ese  alternatives  appears  below.    Other  courses  can  be  con~ 

ceived  and  Interconnected  with  these  suggested  sequences. 

*« 

Flow  Chart  of  Altern^p^^lfalt  Sequences 
for  Short  or  Spe^^lized  Courses 

Exploratory  data  analysis 
Short  QMPM 


No  regression;  contingency 
tables 


Traditional  approach 
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iecture  5-0  Outline  (I)  XVI.III.13 

lecture  5-6  Transparency  ftesentatidn  Suide  (I)  XVI.lII.2i 

Lecture  5-0  transparencies  XVI ill! ^22 

Lecture  5-1  Outline  CI5  aCVI.III.27 

Lecture  5-1  Tfatifpareiicy  Presentation  Oiiide  (I)  XVI  .1X1 .32 

Lecture  5-1  Transparencies  (S)  XVI. Ill ^33 

Lecture  5-2  Outline  (I)  XVI.iil.4(l 

Lecture  5-2  Transparency  Presentatim  Guide  (I)  XVI. 111.53 
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Iiecture  5*2  fransparenctei  (S)  XVi.lli.Sft 

Lecture  5-3  Oitltee  (1)  1^1.111  M 

Lecture  5*3  Transparency  Preientatton  Guide  (I)  a^.IIZ;7b 

Lecture  5-3  Traniparendei  (S5  -  SVIilll,?! 

Bc»evorkp  Unit  5  (S)  IVI.IlliBl 

Bcnevork  SDlutionip  Unit  5  (I)  Z?I.in«88 

Qdtz,  Unit  5  (1)  •                            XVI  ,111, 93 

Quiz  Solutions p  Unit  5  CI)  ZVI.11I.97 

Seeding  Assi^^tSp  Unit  6  (S)  Zvi«Ili«99 

Lecture  6-0  Outline  ^15  3CVl;lll.lbb 

Lecture  6-0  Traniparency  Preietitation  Guide  11}  SEViilllilOS 

Lecture  6-0  Transparencies  (S)  XVI, 111 ,105 

Lectire  6-1  ftitline  (I)  :                              XVI. ill. Ill 

Lecture  6-2  Ctotltie  (i)  XVI. 111. 117 

lecture  6-2  Transparency  ^esentation  Guide  (1)  XVI. III. 123 

Lictitre  6-2  Trahiparencies  (S5  jcvi  .111.124 

Bdnievoxkp  Unit  6  (S5  XVi.lII.t26 

Boewozk  Solutions p  Unit  6  (1)  XVI, III ^129 

<>jt«p  Unit  6  (i)  XVI, 111, 13ft 

Quix^SbiutionSp  Unit  6  (I)  XVI. Ill,  139 

Eeadihg  iiiignentSi  Unit  7  (S)  XVi.lli.lfti 

Lecture  7-p  (Sitline  (15  XVI. 111. 1*3 

Lecture  7-1  Outline  (1)  XVlilll.tftS 

Lecture  7^2  Oxtliiie  (15  XVi.III.153 
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QUIT,  Bnlt  7  (I)  m.111.161 

QUI*  Soltitloni,  ©nit  7  (1)  XVi.iII.i64 

Xntroductibn  to  Ifodule  ±V  (£)  XVl«iV^i 

Prerequisite  Invoitoryi  Unitsr  8  end  9  iS^  .  2VI.IV.3 

Bbmevbrky  Prerequisite  Inventory »  Dnits  8  and  9  IS)  XVI.IV.IO 

Hdiiievbrk  SbiutibnSy  Prerequisite  Ibventbryy  Units  XVI.IV.ll 
8  and  9  (I) 

Reading  Assignments »  Ihiit  8  CS)  XVI.  IV .12 

tecture  8-0  Outline  (i)  XVI. IV. 13 

Lecture  8-1  (Dtttline  (i)  XVi.iV.i5 

Lecture  8-1  transparency  Preseatatibn  Guide  (I)  XVI.IV.I9 

Lecture  8-1  Transparencies  (^g)  XVI.  IV  .20 

Lecture  8-2  Outline  (1)  XVI. IV. 27 

Lecture  8-2  Transparency  Presentation  ^ide  (1)  XVI. IV. 30 

Lecture  8-2  Transparencies  (S)  XVI.iv.3i 

Lecture  8-3  Outline  (I)  XVI.iv.39 

Lecture  8-3  transparency  Presentation  Guide  (I)  XVI. IV. 44 

Lecture  8-3  Transparmcies  (S)  XVI. IV. 45 

Hbinewbrk,  Unit  8  (S)                  _  XVI. IV. 50 

Homewdrk  Soiutions,  Unit  8  (i)  XVI. IV. 54 

Quiz,  Unit  8  (I)  XVI.iv.82 

Quiz  Solutions,  Unit  8  (S)  XVI. IV. 90 

Reading  AsiBigmrats,  Unit  9  (S)  XVI.iV.92 

Lecture  5-0  Outline  CD  ;XViiIV.93 

Lecture  S-O  tranaparency  Prcs^tation  Guide  (ji  ~"  XVI.IV.98 

Lecture  9-0  f raniparenclca  &)  .  inn:;iV.99 

Lecture  9-1  butliiie  CX)  XVi.iv.l03 
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xvi.iv.io? 

L^ttxre  9*1  Transparexicles  CS J  '  ^ 

Z71.£?.iQ8 

i«eccure  v^z  Outline  C:X^ 

XVI. 17. Ill 

xviiiv.iie  • 

XVI. IV. 120 

Lieccure        Transparency  Presentation  Guide  CI) 

.XVI.1V.126 

i#ecuure        xranaparenclea  (.Sy 

XVI.lV.i27 

Hbmewbrk^  Vnlt  9  (S^ 

XVI. TV. 136 

Homework  Sbiiitlbna^  Unit  9  Gl^ 

XVi.iV.139 

Quiz,  Ttalt  9  Cr) 

XVI. IV. 148 

Quiz  Solutions,  Unit  9  <I) 

XVI.IV.153 

Final  Examination,  Second  Term  (1) 

XVI.IV.157 

Final  lamination  Solutions,  Second  Term  (I) 

XVI. IV. 170 

Paekage  Development 

The  package  vsls  developed  at  the  School  of  Urban  and 

Public  Affairs  (SUPA)  of  Carnegte-Meilon  University  (OIU).  SUPA 
offers  both  doctoral  and  masters  degree  programs  which  eiq>hasize 
ptfilic  sector  prdfessibhal  activities  ind  Research.    The  school  is 
heavily  committed  to  research  and  to    innovations  in  teach^g.  The 
educati5nal  staff  at  SUPA  is  quantitatively  oriented  ind  recognizes 
the  essential  l^ortance  of  sophisticated  quantitative  training  at 
ail  graduate  levels. 

At  SUPA  the  heed  to  devel^  skills  for  perforinlng  quantitative 
atudies  and  for  presenting  results  in  an  informative  and  effective 
fashion  has  almys  been  recognized.    Until  1975,  «ati8fylng  this 
need  vis  viewed  In  the  traditional  mahhir  of  including  a  year's 
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sequence  in  intrbductbry  statistics  either  through  a  course  offered 
in  SUFA  or  through  other  statistics  courses  available  at  CHU. 

in  1975  an  tttteispt  vai  made  to  break  with  the  past.    c;^H  was 
put  together  follcnring  the  acquisition  of  information  on  data 
4BLnaly8i8  problems  e3q>eri.enced  by  a  broad  range  of  practicing  public 
managers  and  administrators ^    From  mailed  questionnaires  and 
Interviews  with  practitioners  it  became  evident  that  both  the  approach 
and  content  of  traditional  courses  were  subbptimal  as  far  as  the 
needs  of  public  sector  professionals  were  concerned.    As  a  cbnsequence^ 
a  course  was  developed  that  eiq;)ha8ized  graphics^  exploratory  proce** 
dures  and  robust  analyses.    In  addition^  topics  such  as  survey 
design^  san^ling  methods  and  analysis  of  cross'-classif ied  data 
were  added  while  other ^  less  relevant  material^  ^s  excised. 

in  the  1975-1976  academic  year  an  ^^erimental  version  of 
QMPM  was  offered  to  apiprbximately  20  first  year  masters  students 
at  SUPA.    Emphasising  applicatibn  and  based  in  a  relevant  e^irical 
context^  the  course  proved  to  be  ah  outstanding  success •  9hen 
the  NTDS/HUD  curriculum  development  project  ms  announced >  QMPM 
seemed  to  be  a  natural  base  for  a  proposal  and  it  was  ultimately 
funded. 

The  course  develbpmeht  activity  took  place  between  Hay  1976 
and  August  1977.    Simultahebu8ly>  during  the  academic  year^  an 
e3q>erlmehtal  version  bf  the  cburse  b£sed  on  the  curricular  matefial 
developed  under  the  NT^/HUI)  subcontract  was  taught.    In  additibn> 
a  short  version  of  QMPH  i^s  taught  as  part  of  an  in-service 
training  program  for  personnel  in  consminity  mental  health  programs. 
Feedback  in  the  form  of  student  opinions  and  outside  evaluative 


observation  of  student  progress  were  used  to  fe^rise  the  instruc- 
tional material.    The  final  product  has  been  tested  under  a  variety 
of  situations  and  promises  to  provide  a  significant 'in?)rovement 
in  the  educational  experience  of  public  administrators  and  managers 
in  the  area  of  data  analysis  and  statistical  methods,  teinhardt, 
Leinhardt,  and  Wasserman  (1977)  reports  results  of  a  quantitative 
evaluation  of  the  experimental  1976-1977  implementation. 
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Instrai^l^^.a  Manaal 

Introduction 


Qtxantttative  Metht^ds^t^r  labile  Management  contains  Instriictibnal 
materials  to  cover  four  ipajQy,^^^8  in  data  analysis:     (I)  Exploration 
of  batches  of  data;  (II)    Modeling  continuous  data  using  regression; 
(III)    Probability,  sampling  apd  inference;  and  (IV)    Modeling  cross-- 
classified  data.    Each  module  consists  of  lecture  outlines,  reading 
assignments,  examinations  and  exercises  with  solved  problems,  masters 
for  visuals,  pferequislte^^^ehtbries,  and  material  £br  distribution 

to  students.    The  lecture  outli^s  are  tb  be  used  as  presentation 

-  -       —       -  _____        _    _     .  _ 

guides  for  instructors.    All^^ther  material  is  for  student  use.  Com- 
puter routines  are  also  availaMe  for  student  use  in  performing 
analyses  of  empirical  data, 

The  (fAVt^  modules  are  nbt  self-cbntained  instructibnal  components. 
Their  use  depends  upbh  the  availability  bf  various  commercially  dls*- 
trlbuted  textis^  readers^  and  similar  resources.    These  resources  are 
detailed  below  and  in  the  modules  themselves. 

All  modules  are  similarly  organized.    The  instructional  strategy 
used  follows  established  educational  theory.    Prerequisite  inventories 
are  employed  to  determine  whether  students  pbssess  knbwledge  of  var- 
ious concepts  and  methods  upon  which  mastery  bf  a  unit's  substance 

depends* •    Handout  material  and  references  provide  students  with  de- 

—   -  _      -   -      _       4       -  -    _  -  -   -  - 

tailed  and  sufficient  information  bii  these  prerequisites.  Broad 

based  tihderstanding  of  technical  areas  is  avoided  at  this  stage,  with 

focus  instead  on  specific  tools  or  ideas  that  are  used  in  the  units, 

^e  units  themselves  contain  introductory  material  in  the  form  of 

advanced  brganlzers  which  sensitize  die  student  tb  topics  and  ideas 


that  will  be  covered  in  the  unit.    New  material  is  then  presented  in  ah 
instructional  mode  with  general  principles  fblleftjed  by  examples  of  appli- 
cations or  development.    Visuals,  in  the  form  of  overhead  prdjectar 
transparencies^  are  used  extensively.    Students  sfrould  have  copies  of 
these  slides  in  hand  while  the  lecture  proceeds.    All  examples  are  based 
on  empirical  data  descriptive  of  or  relevant  to  public  policy  or  admin- 
istration.    Students  demonstrate  learned  skills  in  three  situations: 
homework,  papers  and  quizzes.    Homework  is  designed  to  present  students 
with  problems  to  be  solved  in  unpressured  time  periods.    Solution  of 
specific^  well-defined  problems  are  at  issue  here.    Papers  provide  longer 
study  periods  and  require  demonstration  of  comprehension  and  interpre- 
tation of  an  unstructured  problem.     Exams  require  students  to  operate 
under  pressure  to  solve  relatively  straightforward  problems.  Text 
references  and  readings  on  empirical  studies  in  which  quantitative 
analytic  methods  are  applied  to  empirical  policy  issues  provide  students 
with  diverse  examples  of  applications,    eomputer  operations  permit 
students  to  participate  personally  in  numerous  analyses,     it  is  recom- 
mended that  students  write  6ne  10  to  12  page  paper  at  the  conclusion  of 
each  module,  the  topic  should  be  selected  by  the  student  in  consul tatioh 
with  the  instructor.    The  paper  should  contain  a  quantitative  analysis  of 
of  a  public  policy  issue  and  an  extensive  verbal  discussion  of  the  study. 

Objectives 

The  goal  of  QMPM  is  to  help  students  of  public  management  and  admin- 
istration master  a  diverse  set  of  data  analytic  tools.     Closely  associated 
»ith  this  goal  is  that  of  providing  students  with  a  critical  sense  of 
what  is  a  good  and  useful  analysis  and  with  skills  to  present  relatively 


S6 


instructor's  Manual 


compi±cated  analy^«r»8  to  non-technical  audiences  iii  such  a  manner  that 
results  and  implications  are  effectively  communicated.    Paper  assigiunents, 
as  discussed  above,  are  essential  to  achieving  this  goal. 

Instructor's  Role 

Quantitative  Methods  for  Public  Management  (QMPH)  is  first  and 
foremost  a  course  in  data  analysis  and  statistics.    QMPM  has  been 
designed  under  the  assumption  that  students  will  jiot  continue  a  course 
of  study  in  statistics  beyond  their  experience  with  QMPM.  (Although 
QMPM  does  provide  ail  essential  material  for  continuation).  Thus, 
the  material  covered,  the  presentation  process,  ^pirical  context  and 
instructional  activities  have  been  designed  to  achieve  both  a  broad 
introduction  to  quantitative  methods  and  a  deep^  lasting  learning  of 
analytic  skills. 

The  role  o:f 'the  instructor  in  accomplishing  these  objectives  is 
critical.    Because  courses  in  quantitative  methods  are  traditionally 
thought  of  by  students  as  "hard"  courses  and  even  irrelevant  to  their 
main  concerns^  instructors  of  required  quantitative  methods  .courses 
face  a  particularly  difficult  task.    When  the  material  covered  is  as 
novel  as  that  in  QMPM  the  behavior  of  the  instructor  becomes  even 
more  central  to  success. 

The  instructor  must  possess  self-assurance  and  be  able  to  demon- 
strate competence,  with  the  methods  taught.     Instructor  familiarity 
vith  the  substance  and  procedures  of  QMPM  is,  thus,  essential.  Prior 
bo  teaching  QMPM  the  instructor  should  proceed  through  all  of  the 
Instructional  material  so  that  the  essential  features  and  idiosyn- 
:racies  of  the  course  are  known.    To  a  great  extent,  QMPM  is  an 
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attitude— aii  attitude  towards  data^  their  manipulation  and  analysis.  For 
successful  transference  of  this  attitude  to  students,  instructors  must 
be  able  to  demonstrate  it  in  their  own  behavior,  in  their  villingnesa 
to  pursue  unorthodox  analysis  and  to  explore  data  in  attempts  to  make 
the  data  "talk."    This  attitude  is  not  easily  transmitted  to  stud^ta, 
especially  students  who  may  possess  only  weak  mathematical  skills  or 
who  have  been  taught  that  data  are  sacrosanct.    Nonetheless^  acquisition 
of  this  attitude  towards  analysis  aiid  towards  quantitative  data  may 
be  considered  to  be  the  primary  behavioral  objective  of  QMPM. 

The  instructor  is  expected  to  perform  much  in  the  manner  of  an 
instructor  in  any  traditional  course.    A  lecture  situation  is  assumed 
in  which  the  instructor  presents  material  on  a  scheduled  basia  before 
a  group  of  students.    The  lecture  outlines  should  be  used  by  the  in- 
structor as  a  guide  in  the  preparation  of  a  lecture.    The  instructor 
should  promote  questioning  by  sti^dents,  pursue  general  problems  of 
understanding  in  depth  but  leave  f  'jr  private  consultation  an  .  individual 
student's  problem  when  a  brief  response  Is  u]f  satisfactory . 

The  instructor  should  cdnstrr  ct  jptany       jnples  of  the  appllca- 

_:   _    .  _ 

tidn  of  QMPM  procedures.    These]    samples  ne^id  not  be  elaborate  but 

should  demonstrate  how  understai^  ihg  of  a  policy  or  administrative 

issue  is  improved  by  use  of  data  analytic  tools.    The  instructor  should 

aim  to  develop  examples  based  on  local  situations  or  topics  of  current 

national  or  international  interest.    Artificial  examplies,  unless  the  point 

they  make  cannot  be  coirered  in  any  other  way,  should  be  avoided. 

The  instructor  should  be  available  and  responsive  to  student 

inquiries  outside  of  formally  scheduled  class  periods.     Students  are 

required  to  engage  in  numerous  exercises  and  should  be  encouraged  to 
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try  alternative  approaches  to  a  given  problem  rather  than  seek  the  one 
"correct"  answer.    Since  this  will  inevitably  lead  some  students 
into  situations  which  they  do  hot  have  the  knowledge  to  understand, 
they  should  know  thnt  help  is  available.    Remember^  good  positive 
attitudes  towards  the  perfdrinarice  of  analysis  are  essential.    Lack  of 
support  from  instructors  sets  up  a  poor  role  model  and  turns  students 
off.    Since  courses  in  quantitative  methods  have  historically  suffered 
from  a  poor  Image^  instructors  should  act  to  compensate  for  student 
insecurity. 

Besides  being  available  and  supportive ^  instructors  should  provide 
students  with  rapid  feedback  on  exercises,  paper  assignments ^  and  exam 
performance.    Because  of  the  diversity  of  topics  covered  in  QHPM  students 
may  be  unable  to  use  feedback  information  regarding  a  particular  behavior 
or  skill  if  it  comes  long  after  demonstration.     In  addition,  feedback 
is  most  effective  in  learning  if  it  follows  rapidly  oh  behavioral 
action.    Wheii  it  does,  students  can  adjust  their  understanding  or 
modify  a  behavior  while  the  activity  is  fresh  in  their  minds  and, 
possibly,  demonstrate  the  correct  behavior  and  have  it  confirmed  in 
another  circumstance. 

Instructor's  Qgalif ications 

The  instructor  is  assumed  to  have  experience  teaching  quantitative 
methods  or  statistics  at  the  graduate  level.     Experience  at  performing 
empirical  studies  contributes  to  the  instructor's  ability  to  relate 
abstract  notions  or  methods  to  real  life  situations.     It  is.  not  essential 
for  the  instructor  to  be  a  statistician  or  mathematician.    Nor  is  it 
essential  for  the  instructor  to  have  extensive  prior  experience 
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with  all  the  topics  covered  in  QMPM.    An  instructor  with  knowledge  of 
classical  statistics  is  advised  to  read  carefully  ail  text  and  reference 
material  cited  In  the  package  and,  in  particular,  to  read  tukey,  J.W.^ 
Exploratory  Data  Analysis i  Addisoh-Wesley j  1977^  Mbsteller^  F.M.  and 
J.W.  Tukey,  Data  teal^y^s^^ld^egr^ssionr    A  Sec^      Course  in  Statistics, 
Addisoti-Wesley,  1977,  Bishop,  Y.,  S.  Fienberg  and  P.W.  Holland,  Discrete 
Multivariate  Analysis^  MIT  Press,  1975,  S.  Fienberg,  The  Analysts  of 
Cross-Classif ied  Categorical  Data,  MiT  Press,  in  process,  McNeil,  D.R., 
Interactive  Data  Analysis.  Wiley,  1977,  and  Erickson^  B.H.  and 
T.A.  Nosanchuk,  Understanding  Data.  McGraw-Hill,  1977. 

Staff  Support 

QMPM  can  b2  taught  by  a  single  instructor.    However,  with  a  sizable 
class  (lb  or  larger)  the  need  for  rapid  feedback  and  availability  may 
infringe  upon  an  instructor's  other  responsibilities.     In  such  situations 
it  is  highly  advisable  to  have  teaching  assistants  available.  These 
Individuals  should  have  regular  hours  in  which  students  can  have  access 
to  them  and  should  take  responsibility  for  grading  homework  exercises 
and  quizzes.    Since  QMPM  is  supplied  with  worked  problems  for  exercises 
and  exams,  performance  of  these  activities  by  teaching  assistants  should 
pose  no  difficulties. 

if  the  computer  routines  supplied  with  QMPM  aie  employed ^  then  a 
staff  member  should  take  responsibility  for  interacting  with  students 
regarding  their  usage.     The  system  that  is  provided  has  been  extensively 
tested  and  debugged  and^  therefore,  should  heed  no  software  work  beyond 
that  required  for  mounting  on  the  local  computer.    However,  students 
unfamiliar  with  computer  software  packages  may  become  unnecessarily 
frustrated  by  their  own  lack  of  knowledge  about  the  system.    This  can 
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be  relieved  by  assigning  either  a  teaching  assistant  br  another  staff 
member  the  responsibility  of  becoming  adept  at  using  the  systii  and 
relying  on  that  person  to  act  as  an  liihouse  systems  consultant.  Data 
acquisition  and  mounting  might  be  handled  by  the  same  person.  It 
is  suggested  that  a  large  data  library  should  be  acquired  and  left 
open  to  student  exploration. 

Technical  Resources 

Because  of  the  uniqueness  of  QMPM  no  single  textbook  Is  fully 
satisfactory.     Indeed,  even  combinations  of  texts  fall  to  provide 
students  with  complete  material  for  studying  some  topics  covered  In 
the  course.    For  this  reason  It  Is  advisable  to  provide  students  with 
alternative  means  for  reviewing  the  contents  of  a  lecture.  Instructors 
might  wish  to  reproduce  copies  of  lecture  outlines  so  students  will 
have  a  topic  outline  of  covered  material.      Of  particular  utility  here, 
however,  is  the  use  of  video  taping  equipment.    If  such  resources  are 
available,  then  lectures  should  be  taped  and  a  tape  library  of  the 
course  constructed  which  students  can  exploit  at  any  time.  Such 
devices  have  been  highly  regarded  by  students  when  employed  in  experi- 
mental Implementations  of  QMPM.    A  technical  requirement  for  such 
equipment  is  the  ability  to  resolve  small  characters  when  written 
on  a  blackboard; 

QMPM  Is  designed  to  be  taught  by  an  instructor  in  a  traditional 
lecture  format.    A  hall  or  room  igiiich  possesses  ample  blackboard  space 
is  required.    The  instructor  should  feel  free  to  write  examples  on  the 
board,  draw  figures,  and  otherwise  illustrate  material  as  the  need 
arises.    If  video  taping  equipment  is  used*  the  room  should  have 
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safficletit  lighting  to  permit  high  contrast  resolution  of  material 
written  on  the  blackboard.    Since  the  use  of  overhead  transparencies 
is  assumed,  the  room  should  be  jprbvided  with  a  projection  screen  and 
a  location  for  the  projecter.    QhffM  comes  with  dense  paper  maiters  of 
trans  jparenc  les.    These  should  be  reproduced  onto  plastic  slides  by 
the  instructor  through  use  of  appropriate  equipment.    The  instructor 
should  distribute  paper  copies  of  these  transparencies  to  students 
before  a  lecture  and  should  also  assure  that  reproductions  of  other 
hand-out  material  are  available  on  a  timely  basis.     Since  references 
to  contemporary  texts  and  articles  occur  in  both  student  and  instructor 
material,  the  availability  of  a  library  is  advantageous. 

If  the  computer  routines  are  to  be  used,  then  the  routines  should 
be  mounted  on  a  computer  before  the  course  commences.    While  the  system 
has  been  adequately  debugged,  there  are  likely  to  be  local  machine 
idiosyncracies  that  must  be  overcome  for  efficient  operation.  Some 
software  may  have  to  be  written  at  the  Implementation  site  as  a  con- 
sequence.   The  computer  system  is  designed  as  an  Interactive  system. 
A  time-shared  computer  and  hard-wired  or  acoustically  coupled  printing 
terminals  are  suggested.    CRT's  are  not  recommended-    While  the  system 
can  be  operated  in  a  batch  processed  mode,  its  educational  utility  is 
maximized  when  it  is  operated  interactively. 

Use  of  instructional  ^at^erlaXs^ 

The  primary  curricuiar  components  contained  in  the  modules  are 
lecture  outlines,  one  for  every  96  minute  lecture  in  the  course.  These 
are  organized  with  a  zero^^  lecture  containing  advance  organizers  for 
students  followed  by  lectures  containing  substantive  presentations. 
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Ihatructbrs  are  expected  to  uise  these  cdo^bnents  as  topic 
guiides  and  are  not  expected  to  adhere  to  them  absolutely.  Both 
the  level  of  preparation  of  students  and  the  nature  of  the  Inqile- 
mentation  sfaoald  condition  the  actual  presentation "o^^baterial. 
Similarly^  visuals  in  the  form  of  overhead  projection  transparen- 
cies >  homework  problems^  examples ^  iand  test  material  are  provided 
as  guides.    While  those  delivered  in  the  QMPM  package  can  he  used 
as  they  stands  the  Instructor  should  make  an  effort  to  construct 
eoixq;>arable  examples  and  prdblems  which  are  relevant  to  the  specific 
time  and  plac(B  of  the  i^lementation.     In  addition  it  is  the  respon- 
sibility of  the  instructor  to  see  that  copies  of  material  to  be 
used  by  students  and  copies  of  transparencies  be  prepared  and 
distributed. 

The  e3q>ected  usage  is  asrfollows  ([recalling  that  each  module 
is  organized  in  units):    A  preregttiHSite  inventory  containing 
naterial  whose  con^rehensibn  is  required  for  inastery  of  QMPM  unit 
topics  is  distributed.    Homework  problems^  on  prerequisite  inaterial 
(which  can  be  taken  home  or  done  under  in^class  test  conditions) 
follow.    Solutions  to  these  problems  are  given  to  students  after 
they  have  attexx|>ted  to  solve  the  problems.  Difficulties  with 
prerequisite  inventory  problems  should  be  resolved  by  the  student  and 
confirmed  by  the  instructional  staff  prior  to  exposure  to  new  material. 

A  lecture  N*0  (where  N  indicates  the  unit)  precedes  every  unit;  This 
lecture  (which  Is  discretionary)  contains  advanced  organizers  to  focus 
the  student's  attention  on  topics  that  will  be  covered  in  the  unit. 
The  more  complex  the  material  covered  in  the  unit  the  more  important 
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are  thB  advances  organizers,    llie  Instructor  uses  the  lecture  outline 
In  this  and  every  case  as  a  guid^  modifying  and  adjusting  the  presenta- 
tion as  style  and  context  dictate. 

The  laaterlal  presumes  a  lecture,  I.e.,  ah  ihstructbr  standing  before 
an  audience  and  inaklng  an  oral  presentation.    Each  lecture  is^  §0  minutes 
in  duration  Cwhich  may  be  orgariized  into  one  or  tvo  class  sessions). 
Presentation  aids  that  are  also  assumed  are  a  blackboard  with  space 
sufficient  for  copious  drawings  and  writing  equations,  an  overhead 
projector  and  screen  and  duplication  facilities  for  producing  handouts 

-to  a  lecture.    Suggested  presentation  sequences  for  transparencies 
are  keyed  by  number  in  the  lecture  outline  (numbers  in  brackets  on 
righthand  side)  and  summarized  in  a  transparency  guide. 

Fbllbwing  presentation  of  the  unit's  introductory  lecture  a  student 
reading  assignment  is  distributed.    The  number  of  lectures  for  each  unit 
depends  upon  the  unit's  contents.    Following  the  substantive  lectures 
instructors  should  provide  a  review  lecture,  although  such  lectures  are 
diBcretionary.    Classes  of  advanced  or  escperlenced  students  may  not  need 
review  while  slower  students  will  find  reviews  critical  to  complete  mastery. 

Botnewbrk  problems  and  quiz  material  with  worked  solutions  follow 
each  unit's  lecture.    The  homework  schedule  should  assure  rapid  return 
of  graded  and  corrected  problems.    Students  should  require  no  more 
than  one  week  to  hand  in  homework  and  should  receive  corrected  home-- 
work in  two  to  three  days.    Students  failing  to  comply  with  homework 
requirements  or  v^o  consistently  hand  in  erroneous  probl^  sets  should 
be  singled  out  for  remedial  help.    A  unit  quiz  should  be  conducted 
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to  a  classroom  under  examination  conditions.  Students  should  be  permitted 
the  use  of  hand-held  caicaiators. 

This  entire  procedure  is  repeated  for  each  unit.    A  week  of  class 
shbtild  normally  include  three  hours  of  instruction  with  one  to  two  addi- 
tional hours  available  for  workshop  and  review  (with  instructional  staff) . 
Workshops  are  also  used  for  the  presentation  of  special  material  (such 
as  "Some  Principles  of  Graphics  for  Tables  and  Charts"  in  Module  I) . 
The  instructor  is  responsible  for  organizing  and  facilitating  the 
functioning  of  the  course.    The  instructor  is  also  expected  to  elaborate 
on  the  policy  relevant  nature  of  exan^les,  problem  sets,  and  outside 
readings.    These  elaborations  should  be  made  during  normal  lecture 
presentations  and  through  handouts  of  worked  problems  derived  from 
local  (i.e.,  the  locale  where  the  iinplementatibn  occurs)  situations. 

Workshops  should  include  extensive  discussions  of  applications. 
Students  are  expected  to  attend  lectures ^  complete  homework  problems, 
and  take  quizzes  and  the  final  examination,     in  addition,  students 
should  be  required  to  produce  two  papers  (10  to  20  pages  in  length) 
within  a  semester's  time  in  which  QMPM  techniques  have  been  applied 
to  a  policy  or  public  inanagemeht  problem  of  the  student's  own  choosing. 
These  requirements  allow  the  student  to  perform  data  analysis  in  three 
types  of  situations:    homework  provides  structured  problems  with  lax 
time  constraints;  quizzes  provide  structured  problems  with  tight  time 
constraints;  papers  provide' unstructured  problems  with  extended  time 
constraints.     In  all  cases  grading  should  be  based  oh  the  effective 
solution  of  the  problem  and  its  interpretation  by  the  student.  Total 
student    preparation  effort  should  range  from  two  to  four  hours  per 
lecture. 


tJse  of  the  compoter 

Qh£PM  £8  provided  with  a  conqpater  packsge  containing  a  set 
of  analytic  routines  and  a  data  library;    these  are  meant  to  be 
used  by  the  student  in  exercising  learned  skills.  Students 
are  expected  to  use  the  system  in  doing  homework  problems  and 
trritlng  papers.    While  not  essential  to  the  successful  l^lementa- 
tion  of  Q>fPM^  the  computer  system  does  provide  students  with 
opportunities  for  performing  elaborate  studies  and  for  carrying 
but  numerous  analyses  ^ere  one  study  wot^ld  otherwise  be  con-* 
sidered  sufficient.    Since  ^^irical  data  are  idiosyncratic  the 
effective  analyst  and  critic  of  analyses  should  have  extensive 
"hands-on"  experience  with  en^irical  studies.    Such  e^eriextce 
often  is  a  consequence  of  a  long  career.    Students  of  tj^H, 
,  however,  by  using  the  computing  systan,  can  develop  such  experience 
while  they  participate  in  the  course.    Homework  problems  that 
require  computer  assistance  are  indicated.    Alternative  software 
which  pexrmits  students  to  perform  necessary  computations  on  a 
machine  may  be  used  in  place  of  the  routines  In  the  QMPM  package. 

 ^  y . 

Audience 

QMPM  is  designed  as  an  entry  level  masters  course  of  one 
year  duration.    Students  in  such  a  class  are  expected  to  have 

successfally  completed  college  mathematics  courses  soithat  they 
are  proficient  in  algebra.    iThile  some  matrix  algebra  and  cal- 
culus are  used,  the  prerequisite  inventories  and  handouts  supplied 
with  qifPM  provide  sufficient  coverage  of  these  tools.    No  knowledge 
of  statistics  is  assumed  nor  is  any  experience  with  computers 
or  programming  required. 

^  TOT  nn.  t** 
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Because  of  its  unitlzed-inbaular  structure  a  variety  of 
courses  other  than  a  one  year  sequence  may  be  generated  frcm  the 
QMPH  package.    At  the  graduate  school  level  these  may  take  the 
form  of  short  courses  on  specific  topics  or  one  semester  courses 
containing  selected  modules,    in-service  training  programs  may 
also  be  developed.    In  each  case  student  prerequisites  are  the  same 
as  for  a  one  year  course  save  that  certain  advanced  units  in  the 
package  build  upon  material  covered  in  other  units. 
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Intredttctloh 

Quantitative  Methods  for  Public  Management  «9yfPM)  Is  a  course 
of  Ihstructlbn  In  data  analysis  and  statistics  for  students  of 
public  management  and  admlhlstratibti.    The  course  Is  designed  to 
teach  you  how  to  perform  and  criticise  data  analyses  and  how 
to  Interpret  and  present  analytic  results  for  effective  communi- 
cation to  nbn-techhlcal  audiences.    The  course  was  developed  at 
the  School  of  Urban  and  Public  Affairs  of  Carnegie-Mellon  University 
as  part  of  a  curriculum  development  project  funded  by  the  Federal 
Department  of  Housing  and  Urban  Development. 

CP?M  Is  structuri^  Into  four  modules  which  cover  a  diverse 
set  of  quantitative  analytic  methods.    Because  of  Its  modular 
structbre  your  Instructor  has  the  option  6f  presenting  all  of  the 
material >  In  which  case  a  year  long  course  of  study  Is  assumed^  or 
selecting  cboqionents  for  shorter  periods  of  Instruction;  the 
topics  covered  have  been  selected  specifically  for  their  utility 
In  policy  and  administrative  studies;    Conteoqporary  educational 
theory  has  been  used  throughout  to  assure  that  you  will  have  the 
greatest  chance  ©f  mastering  the  material  and  davelbplng  a  deep  under- 
standing of  principles.    In  addition^  to  iflq>rove  relevance  of  the 
course^  all  exsiq>les^  exercises  and  examinations  are  based  on 
eoq>lrlcal  data  that  derive  from  or  are  relevant  to  public  policy 
and  administrative  Issues. 
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(^im  lis  designed  as  an  entry  level  year  ibitg  inasters  course. 
However^    its  modular  structure  permits  it  to  be  used  in  shorter 
course  sequences  and  in  in-service  training  programs.    The  student 
is  assumed  to  have  successfully  mastered  a  college  mathematics  course, 
in  idme  instances  mathematical  skills  beyond  this  level  are  required. 
In  these  cases  a  prerequisite  inventory  test  will  be  administered 
by  your  instructor  and  material  will  be  provided  to  aid  you  in 
acquiring  mastezy  of  the  necessary  concepts  and  tools. 

instructional  Organization 

QMPM  is  designed  as  a  lecture  course.    Your  instructor  will 
prepare  presentations  based  on  the  instructional  material  in  the 
package.     Each  unit  of  material  will  be  preceded  by  a  presentation 
in  vhic>i  the  instructor  vd.ll  describe  the  objectives  of  the  unit^ 
the  types  of  skills  you  will  learn  and  the  nature  of  the  problems 
these  skills  will  enable  you  to  solve.    These  advance  organizers  will 
help  focus  your  expectations  about  the  unit  and  will  enhance  your 
receptivity  when  new  Ihfbrmatioh  is  provided. 

You  should  prepare  for  each  lecture  by  reading  the  text  and 
istrticle  assignments  feMore  class.    These  refer  to  textbook  discussions 
and  application  escamples.    Endeavor  to  become  as  familiar  as  possible 
with  each  new  idea  you  encounter  so  that  each  lecture  will  be  more 
readily  understood.    Ybti  should  expect  to  spend  between  two  and  'four 
hours  preparing  for  each  lecture. 

Prior  to  each  lecture  you  will  receive  a  handout  of  repro^ 
ductions  of  any  visuals  your  instructor  will  use  in  the  presentation. 


Student's  Manual 


This  will  enable  you  to  refer  to  visuals  during  the  lecture  which 
are  not  at  that  moment  bc^lng  projected.    The  lecture  Is  keyed  to 
these  displays^ which  provide  examples  and  figures  Illustrating 
concepts  and  methods  ccTvered  In  the  lecture. 

During  a  lecture  you  should  feel  free  to  raise  questions  con<- 
cerhlhg  material  being  presented.     It  Is  important  that  ybti  feel 
that  you  understand  the  procedures  taught  and  can  apply  them  in 
other  contexts* 

Following  each  lecture  you  should  use  your  notes  and  copleas  of 
the  visuals  to  review  the  lecture's  substance.     If  the  lecture  has 
been  video  taped  you  should  use  this  resource  to  review  aspects  of 
the  lecture  that  may  seem  more  difficult  than  others. 

Hbixiewbrk  exercises  will  be  dlstribut<ed  each  week.    These  should 
be  atten^ted  as  soon  as  the  topics  uhey  refer  to  have  been  covered 
in  class.     It  is  absolutely  critical  that  ydti  do  the  homework. 
Data  analysis  is  a  skill  which  can  only  be  mastered  through  use. 
The  homework  gives  you  ah  opportunity  to  exercise  and  perfect  your 
newly  learned  skills  by  exploring  the  kind  of  data  that  you  are 
likely  to  encounter  in  yotir  professional  career.    You  should  try  to 
become  facile  at  organizing^  analyzing^  and  .interpreting  these  data. 
QMPM  is  designed  to  inaxlmize  the  benefit  you  will  derive  from 
learning  quantitative  methods^  but  there  is  no  substitute  for  exten- 
sive experience. 

In  doing  your  homework  and  in  taking  examinations  you  will  find 
a  hand  calculator  to  be  invaluable.    These  devices  are  relatively 
inexpensive  and  one  should  be  purchased  at  the  beginning  of  the  course* 
You  will  need  a  machine  that  has  the  four  arithmetic  functions. 


QMPM 


logarithms  and  exponentiation.    A  memory  is  advantageous  bat  not 
necessazy. 

The  QMPM  package  includes  a  cdiq)uter  system*    If  this  system 
is  avaiiabie  in  your  course  you  should  learn  to  use  it  early, 
it  will  help  your  teaming  experience  by  removing  the  drudgery  of 
repetitious  arithmetical  operations  from  J^dur  exercises  and  permit 
you  to  concentrate  on  analytic  strategy  and  on  replicating  similar 
analyses  on  different  data  sets.    Thus,  you  will  be  able  to  amass 
more  experience  with  data  ^alysis  tkan  if  you  had  to  rely  on  hand 
calculations.    Some  exercises  supplied  with  the  course  are  meant 
to  be  done  on  a  conq>uter.    These  will  be  so  indicated. 

Qh^M  is  supplied  with  worked  exercises  arid  exam  problems.  Thus, 
you  can  Qq)ect  rapid  feedback  from  your  instructor  if  you  conq>lete 
your  assignments  pronq)tly.    When  feedback  dii  new  skills  occurs 
shortly  after  demonstration  of  the  skill,  the  learning  process  is 
more  efficient  and  effective.    This  is  particularly  inq^drtant  in 
the  case  of  a  technical  course  \Aere  new  skills  build  on  older 
ones.    There  is  a  cumulative  process  iin^olved  irfiich  will  be  short 
circuited  if  you  fall  behind  significantly. 

As  you  progress  In  QMPM  you  will  discover  that  other  courses 
that  you  may  be  taking  will  become  easier.    QtS>M  covers  basic  notions 
and  methods  in  data  analysis  and  statistics.    The  procedures  you  will 
ieani  in  ^^PM  are  used  thrdujghout  the  social  and  policy  sciences, 
and  thus  articles  or  textbooks  you  may  read  in  other  courses  can  be 
expected  td  make  use  of  them.    Consequently,  successrul  ^stery  of 
QMPM  will  enhance  the  successfulness  of  your  entire  program  of  study. 
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General  Introduction  to  Quantitative  Methods 
for  Fbbiic  Management  (^ffM) 

Many  of  you  are  wondering  why  students  of  public  management  are 
required  to  study  quantitative  methods.    The  reason  is  this:  Public 
management  involves  decision  making »  and  making  effective  decisions 
requires  careful  evaluation  of  information.    Today,  information  of 
relevance  to  public  managers  often  comes  in  quantitative  form.  Its 
evaluation  requires  an  operational  lotowledge  of  analytic  methods. 
This  course  is  designed  to  provide  this  knowledge  by  teaching  you 
data  analytic  skills  that  will  enhance  your  ability  to  gather  quan- 
titative data,  operate  on  them».  and  use  them  to  make  better^  more 
effective  decisions.    The  course's  curriculum  has  been  carefully 
designed  to  provide  you  with  a  variety  of  tools  which  will  cover  most 
of  the  data  analytic  problems  you  will  encounter  as  practitioners.  To 
help  make  the  course  representative  of  reality  it  contains  an  exten- 
sive library  of  real  data»  the  same  kind  of  data  that  operating  pub- 
lic managers  use»  so  that  your  learning  experiences  will  come  as  close 
as  possible  to  the  realities  of  public  management.    You  will  be  asked 
to  exercise  your  new  skills  on  these  data  as  you  progress.    To  help 
you  acquire  these  skills  there  exists  an  elaborate  support  system  com- 
posed of  pedagogic  procedures^  personnel  and  audio-visual  eq[uipment. 
^e  system  will  be  described  in  this  introduction  and  in  a  class 
presentation. 

QMPM  is  a  new  course.     One  might  even  say  that  it  is  a  revolution- 
ary course.     It  is  revolutionary  in  that  it  breaks  with  traditional 
approaches  to  teaching  quantitative  methods  in  both  its  pedagogy  and 
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its  substance.    Pedagoglcally ,  It  emphasizes  mastery  iearning  concepts, 
the  organization  of  topics  into  inferred  learning  hierarchies  with 
clearly  specified  skill  prerequisites.    You  will  need  to  taow  the  pre- 
requisite material  before  you  can  proceed  through  a  topic,  these 
prerequisites  include  all  but  the  most  basic  skills  that  you  will  need 
to  succeed.    There  are  ho  hidden  assumptions;  no  special  knowledge  is 
required  to  master  a  given  topic  other  than  what  is  clearly  spelled  out 
in  the  prerequisite  inventories  that  precede  each  of  the  four  major 
sections  or  modules  composing  the  course.    Facilities  have  been  pro- 
vided and  time  will  be  set  aside  to  help  you  master  these  prerequisites 
should  they  be  unfamiliar  to  you. 

Mastery  learning  also  means  that  you  will  not  be  graded  on  a  curve 
or  nbrmed.    There  is  nothing  in  this  course  that  is  tob'-dif f icult  for 
any  of  you  to  master.     If  you  all  master  the  material ^  you  will  ail 
pass  liigh.     If  problems  arise  because  of  a  lack  of  comprehension  or 
understanding*  numerous  resources  exist  to  help  you  locate  the  specific 
difficulty  and  obtain  ultimate  mastery  of  the  skill,    in  general,  you 
should  feel  assured  that  every  effort  will  be  made  to  help  you  master 
a  iskill  before  pushing  ahead  to  new  material. 

The  other  new  aspect  of  this  course  rests  in  the  selection  of 
topics  to  be  covered.    QMPM  is  not  a  course  in  statistics.    While  some 
topics  will  be  covered  that  are  discussed  in  traditional  statistics 
courses*  they  are  approached  from  a  pragmatic  rather  than  a  theore- 
tical point  of  view.    The  theory  discussed  will  be  just  sufficient  to 
insure  comprehenisibn  of  particular  skills  and  awareness  of  their  limi- 
tations.   The  emphasis  will  be  on  doing  analysis  rather  than  studying 

analysis.  ^ 
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Although  QHPM  is  hot  a  statistics  course  in  the  traditional  sense^ 
it  is  a  course  in  the  analysis  of  quantitative  data,     in  addition  to 
some  traditional  methods  of  statistical  analysis,  a  variety  of  new  tools 
and  analytic  methods  which  were  pioneered  iii  the  late  1960  *s  by  John 
W.  Tukey,  a  statistician  at  Princeton  University  and  Bell  Laboratories, 
will  be  covered.     The  hew  methods  which  Tukey  and  others  have  been 
developing  emphasize  the  exploratory  hature  of  data  analysis,  the 
"detective"  work  that  precedes  the  traditional  4nfereniiai  stage  of 
cohfirmatory  statistics*     In  ek^larat^ory  da^a  analysli^  (EDA),  the 
analyst  first  organizes  the  data  to  understand  what  kinds  of  questions 
cah  be  answered  by  thein  and  what  kinds  of  operations  must  precede  the 
application  of  confirmatory  or  inferential  procedures. 

Exploratory  data  analysis  possesses  several  features  that  are 
especially  useful  to  public  managers.    First,  it  relies  heavily  on  the 
use  of  graphic  displays  as  analytic  tools.     Traditionally,  displays 
have  been  used  as  final  summaries  presented  only  after  an  analysis  was 
completed.     In  QMPM,  however,  graphics  are  used  as  integral  parts  of 
the  analytic  process,  so  that  they  may  provide  critical  information 
about  the  data  and  the  process  of  the  analysis.    The  graphics  used  in 
QMPH  are  relatively  simple  and  easily  learned.     Indeed,  once  the 
graphical  methods  are  introduced  you  will  discover  that  they  have  a 
kihd  of  "face  validity" — what  they  mean  is  obvious  from  the  way  they 
appear.    The  face  validity  of  EDA  graphics  will  be  a  great  advantage 
to  you  in  your  professional  career.    The  graphics  that  you  will  learn 
to    use  can  be  presented  to  non-technic^i  audiences  and  will  probably 
be  understood  with  only  minimal  explanation ,    Consequently,  rather 
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complicated  notions  or  analytic  results  can  be  coinmunlcated  during 
presentations  to  Individuals  who  have  a  vide  variety  of  backgrounds-^-a 
situation  you  will  undoubtedly  encounter  often  in  your  professional 
careers. 

The  second  feature  of  EDA  which  will  enhance  its  utility  to  you 
is  the  use  of  resistant  or  robust  procedures.    These  are  procedures 
which  yield  results  which  are  relatively  unaffected  by  occasional 
missing  or  incorrectly  recorded  data  values  or  incompletely  specified 
models*    Most  traditional  cbhfirmatbry  statistical  procedures  are  hot 
resistant  in  that  they  are  easily  influenced  by  a  few  widely  divergent 
data  values »  nor  are  they  robust  in  that  the  misspecif ication  of  a 
model  can  yield  invalid  results.    But  public  managers  often  must  rely 
on  data  of  less  than  highest  quality ,  data  collected  for  other  purposes , 
and  models  that  neglect  some  variables.     EDA  procedures  are  particularly 
helpful  in  such  situations.     The  resistant  and  robust  qualities  of  the 
procedures  covered  in  QMPM  are  so  important  that  without  them  in  many 
situations  an  investigator  would  not  be  able  to  conduct  a  thorough 
study. 

Many  of  the  techniques  you  will  learti  in  this  course^  both  explor- 
atory and  confirmatory,  are  among  the  newest  in  the  field  of  data 
analysis.    Learning  such  up-to-date  skills  will  put  you  on  the  "cutting 
edge"  of  the  field.    The  nex^ess  of  these  techniques,  however,  does 
present  some  difficulties  as  far  as  communicating  with  others  whose 
training  in  quantitative  methods  occurred  some  years  ago.    You  will 
be  learning  procedures  which  have  only  recently  been  made  available 
to  the  general  public.    Most  texts  you  will  use  were  published  as  QMPM 
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vas  developed.    Not  many  technical  data  analysis  or  public  management 
practitioners  are  familiar  with  thim.    thus,  yoo  shoald  expect  to  find 
yoarseif  frequently  explaining  what  you  have  done,  even  to  individuals 
whom  you  might  normally  believe  were  familiar  with  data  analytic  pro- 
cedures. ^ 

Even  though  many  of  the  techniques  we  will  cover  are  relatively 
straightforward,  some  are  co|nplex.    Moreover,  even  simple  procedures 
performed  on  large  data  sets  can  be  extremely  tiTne  consuming  when 
done  by  hand.     Consequently,  a  coinputer  system  (CMU-DAP  for  Carnegie- 
Mellon  University-Data  Analysis  Package)  exists  to  facilitate  doing 
analysis.     (This  is  an  optional  part  of  the  QMPM  package.)     The  system 
permits  data  entry,  manipulation,  and  analysis  in  a  simplified  format. 
Doing  data  analysis  on  a  time-shared  computer  will  facilitate  your 
mastery  of  analytic  skills  by  allowing  you  to  try  many  different 
approaches  to  the  same  problem.     Thus,  you  will  be  able  to  gain  wide 
experience  in  applying  your  skills  without  fear  that  you  will  have  to 
invest  an  inordinate  amount  of  time  on  arithmetic  operations.  By 
having  the  "grundge"  work  of  data  analysis  performed  by  computer  you 
should  be  free  to  concentrate  on  planning  and  interpreting  your 
analysis  and  on  exploring  alternative  approaches.     If  it  is  available 
in  your  course,  you  will  be  introduced  to  CMU-DAP  during  the  second 
week  of  QMPH  in  a  special  three  hour  session,  and  will  be  expected  to 
use  it  for  both  homework  and  paper  assignments. 

The  library  of  real  data  that  has  been  prepared  for  your  use 
has  already  been  mentioned.     Typically,  the  data  analyzed  in  traditional 
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statistics  courses  are  fabricated  for  the  purpose  of  illustrating  a 
particular  technique.    More  dfteii  than  not,  such  data  are  very  unreal- 
istic—not the  type  which  you  would  actually  confront  In  the  "real 
world".    Consequently,  students,  when  they  leave  school  and  engage  in 
data  analysis  in  the  field,  often  find  that  their  training  has  not 
prepared  them  for  the  Vagaries  of  reality,    in  Q!B*M  these  situations 
are  avoided.    QMPM  stresses  the  analysis  of  real  data,  data  gathered 
from  practitioners,  faculty,  students,  and  published  sources.  This 
collection  has  been  organized  into  a  computer  based  DataBank  that  can 
be  accessed  with  CMU-DAP.     In  addition  to  choosing  data  from  the 
DataBank  for  analysis  throughout  the  course,  you  will  also  be  expected 
to  gather  real  data  and  analyze  it. 

In  summary i  QMPM  may  very  well  be  the  most  important  course  that 
you  take  in  graduate  school.    You  are  to  participate  In  a  revolution- 
ary approach  to  quantitative  methods — you  are  cb-cohsplrators  in  ah 
attempt  to  make  data  analysis  relevant  and  useful  to  public  manage- 
ment . 
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Introduction  to  Module  I 

Overview 

Module  i  of  the  Quantitative  Methods  for  Public  Management  pack- 
age contains  two  units,  numbers  1  and  2.    Unit  1,  Sitvgle^e^tches  of 
fiata,  introduces  the  student  to  the  notion  of  a  data  batch,  a  fit, 
and  an  effect.     It  focuses  on  the  organization,  condensation,  and 
analysis  of  simple  situations,  single  batches.     The  general  objective 
is  to  familiarize  students  with  data  and  elementary  models  and  to  pro- 
vide students  with  a  set  of  basic  tools  for  summarizing,  displaying, 
and  working  with  data.     Single  batches,  essentially  a  single  set  of 
observations  on  one  variable,  are  considered  in  depth;*    The  tools 
introduced  include  classical  procedures  such  as  histograms,  sorts, 
means,  and  standard  deviations.     But  the  emphasis  is  on  tools  of 
exploratory  data  analysis  such  as  stem-and-leaf  displays,  order  sta- 
tistics and  transformation  procedures.     The  definition  and  features  of 
a  well-behaved  or  Gaussian  batch  are  also  considered,  and  a  special 
section  discusses  the  features  of  good  graphics  and  charts. 

In  unit  2,  Multiple  Batches  of  Data-Unordered,  the  student  is 
introduced  to  the  more  complicated  situation  in  which  more  than  one 
distinct  set    of  observations  exist.     The  tools  introduced  in  unit  1 
are  used  in  unit  2  to  facilitate  comparison  of  effects  among  batches. 
Since  differences  in  spread  among  the  batches  can  confound  thedeter- 
mination  of  differences  in  level,  a  procedure  for  finding  a  transfor- 
mation that  equalizes  spread  is  introduced.     This  procedure  prepares 
students  for  variance  stabilizing  transformations,  introduced  in  a 
later  unit  in  the  context  of  multiple  regression. 
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Specific  Objectives 
Unit  1 

Upon  successful  completion  of  Unit  i  a  student  will  be  able  to 
organize  a  batch  of  data  using  simple  sorts,  stem-and-ieaf  displays, 
and  histograms.     The  student  will  be  able  to  describe  the  batch  of 
values  using  various  computed  summary  numbers  and  to  display  the  sum- 
mary by  constructing  a  schematic  plot.     In  addition,  the  student  will 
be  able  to  determine  if  a  symrnetrlzlng  transformation  would  facilitate 
contrasting  the  batch  with  a  well  behaved  batch  and  will  know  how  to  deter 
mine  a  good  transformation  for  this  purpose.     The  student  will  toiow  how 
to  recognize  a  well  behaved  batch  and  use  and  evaluate  classical  summary 
statistics  in  their  description.    The  student  will  also  have  a  critical 
appreciation  for  effective  graphic  and  tabular  displays  and  be  able 
to  construct  uncluttered ^  inf orti^ative  charts  containing  quantitative 


Upon  successful  completion  of  Unit  2  a  student  will  be  able  to 
recognize  a  set  of  non-ordered  multiple  batches  and  use  parallel 
stem-aiid-leaf  displays  and  parallel  schematic  plots  to  compare  the 
batches  to  one  another.    To  improve  the  effectiveness  of  comparison 
wheii  spreads  in  the  individual  batches  vary  greatly,  the  student  will 
know  how  to  use  median  by  midspread  plots    to  find  a  spread  stabilizing 
transformation.     The  student  would  then  proceed  to  perform  an  analysis 
on  the  transformed  data. 


facts. 


Unit  2 
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Prerequisite  Inventory 
Units  1  and  2 

Units  1  and  2  of  Module  I  focus  on  the  analysis  of  single  and 
multiple  batches  of  data.     Prior  to  the  presentation  of  the  material 
in  these  two  units ,  we  shall  discuss  several  elementary  concepts* 
The  mastery  of  these  concepts  is  an  essential  prerequisite  to  mastery 
of  the  skills  taught  iii  Units  1  aiid  2.     Before  proceeding  to  Unit  1, 
you  should  assure  yourself  that  you  are  familiar  with  these  basic 
concepts. 

The  inventory  is  divided  into  the  following  five  sections: 

1.  Numbers — Properties  and  Representation 

2.  Data  Vectors — Observations,  Subscripts,  Indexing,  Summations 

3.  Data  Sets — Variables  and  Various  Transformations 

5.    Plots- aroL  X^raph- J^^er 
Additional  references  to  these  topics  appear  at  th^  end  of  this 
inventory.     Specific  topics  in  these  five  areas  will  be  reviewed  in 
class  only  if  the  average  performance  of  the  class  indicates  that 
such  discussion  is  necessary.     If  areas  that  yoa  are  weak  in  are  not 
covered  in  class,  you  should  consult  a  member  of  the  course's  teaching 
staff  to  determine  how  best  to  achieve  mastery. 

Section  1.    Numbers — Properties  and  Representation 

Throughout  this  course  numbers  are  used.     Consequently,  the  more 
important  properties  of  the  number  system  need  to  be  reviewed.  These 
properties  are  discussed  in  chapter  1  of  Rosenbach,  et.al.   (see  the  end 
of  this  inventory  for  full  reference). 
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it  is  assumed  that  all  numbers   worked  with  in  this  course  belong 
to  the  set,  or  collection,  R  of  real  numbers.     Real  numbers  are  those 
which  can  be  represented  by  terminating  or  nonterminating  decimals, 
included  in  R  are  those  numbers  without  decimal  places,  the  integers . 
An  integer  may  be  positive,  i,2,3,,..,  negative,  -1 ,-2 ,-3 , • • • ,  or 
zero,  0. 

When  writing  down  single  numbers,  you  should  take  the  time  to 
record  ail  the  digits  (including  all  the  decimal  places)   in  the  number 
to  convey  as  much  information  as  the  number  allows.     The  digits  of 
accuracy  required  in  writing  a  number  are  called  the  significant 
dibits  or  significant  figures  of  the  number.     In  general,  the  num- 
ber of  significant  digits  of  a  number  equals  the  number  of  digits  of 
accuracy  that  the  measuring  instrument  allows.     If  inches  are  recorded 
with  a  ruler  marked  with  tenths  of  inches,  then  the  first  decimal  place 
of  the  recorded  numbers  wiii  always  be  a  significant  digit.     For  ex- 
ample, 16. G  inches  has  3,  and  not  1,  significant  digits.     The  population 
of  New  England  in  1790  is  another  example.     In  this  year  it  was 
1,GG9,4G8  persons.     This  number  has  7  significant  figures.     If  one 
chooses  not  :     record  ail  the  digits  of  a  number,  the  quantity  of 
significant  figures  is  reduced.     If  the  1790  population  of  New  England 
has  been  approximated  by  1,009,000,  a  number  with  only  4  significant 
digits,  3  significant  digits  would  have  been  lost.     It  is  important  to 
note  how  "fine"  the  scale  of  the  measuring  instrument  is,  since  this 
knowledge  is  essential  in  determining  the  total  number  of  significant 
digits  of  the  recorded  numbers. 
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Occasionally^  numbers  are  expressed  in  a  manner  which  draws 
attention  to  the  significant  figures  of  the  number.    This  can  be  done 
by  writing  the  number  as  a  product  of  ah  integer  power  of  10,  and  a 
number  between  1  and  10,  that  is>  a  number  with  one  digit  to  the 
left  of  the  decimal  point.     This  method  of  recording  numbers  is  called 
Scientific  notation.    As  ^n  example,  the  1970  population  of  the  United 
States,  203,211^926  persons^  can  be  written  as  2.03211926  x  100,006,000 
(a  hundred  million)  or  2.03211925  x  10^  (9  significant  digits).  We 

may  wish  to  approximate  this  number  emphasizing  only  the  number  of 

—  —      _     .      __  g  . 

millions  as  203*000^000  or  2.03  x  10     (3  significant  digi.:s)  .  Table 

1  shows  various  powers  of  10^  both  positive  and  negative,  that  you 

should  be  acquainted  with. 

[Table  1  here, ] 

Once  a  number  is  recorded  in  scientific  notation,  the  number  of 
significant  figures  of  the  number  equals  one  more  than  the  number  of 
decimal  places*  and  the  correct  power,  or  exponent  ^  of  10  determines 
the  magnitude  of  the  number.    Hence ^  the  1970  population  of  the  United 
States  has  a  magnitude  of  8. 

Occasionally  one  may  wish  to  record  a  number  with  fewer  than  its 
usable  number  of  significant  figures.    This  technique  is  called 
rounding.     It  saves  time  and  increases  comprehension  when  more  than  a 
few  numbers  are  to  be  examined.     The  1790  New  England  population  may  be 
rounded  to  1,009,000  persons  (4  significant  figures)  or  even  1,000,000 
persons  (only  1  significant  figure).    Digits  are  always  rounded  to  the 
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TABLE  1 
Powers  of  10 


Power  Number 

-6 

-6  10      «  one  mllllbrith 

-5  lO"^  =  one  hundred  thousandth 

-4  _  _ 

-4  10      =  one  ten  thousandth 

-3  _  _ 

-3  10      =  one  thousandth 

-2 

-2  10      =  one  hundredth 

-1  lO"^  =  one  tenth 

0  10^    =  one 

1  10^    =  ten 

2  -  -     

2  10      =  one  hundred 

3  _   

3  10      =  one  thousand 

4  10^    =  ten  thousand 

5  10^    =  one  hundred  thousand 

6  10^    =  one  million 

9    _  ___   

9  10      =  one  billion 

12    _ 

12  10      =  one  trillion 
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nearest  number,  with    0,1,2,3,  and  4  rounded  down,  and  5,6,7,8,  and  9 
rounded  up.     (Note  the  5-5  split  of  the  digits.)  Thus: 

19.1  19 
17.7  ^  18 
16.5    ^  17. 

fukey,  in  his  text  Exploratory  Data  Analysis^  suggests  rounding  num- 
bers whose  last  digit  is  5  to  the  nearest  ^ven  number.  Thus^ 

16.5  -V  16 
17.5  18 

We  recommend  the  former  convention. 

Occasionally  it  is  convenient  to  reduce  the  number  of  significant 
figures  by  just  dropping  off  the  unnecessary  digits.     This  is  called 
cutting  and  is  quicker  and  easier  than  rounding.     In  QMPM  cutting  is 
used  in  certain  instances,  although  when  accuracy  is  desired  rounding 
is  generally  preferred.    When  the  decimal  portion  of  a  number  is  dropped 
and  only  its  integer  component  is  recorded^  the  operation  is  called 
truncating.     Rounding^  cutting  and  truncating  are  discussed  in  chapter 
1,  pages  3-5*  of  Tukey  (1977). 

Section  2.     Data  Vectors — Observations,  Subscripts,  Indexing,  Summation 

*  t>^^^h  of  numbers  is  a  set  of  similar  numbers,  obtained  in  some 
consistent  fashion.     Simple  examples  of  a  batch  are:     1)  Average 
family  incomes  for  each  of  Pittsburgh's  186  census  tracts;  2)  Popu- 
lation of  New  York  State  for  each  year  betv7een  1900  and  1970,  inclusive; 

-  -  ......  ...  •  '  '  S.. 

3)  Distance  traveled  from  home  to  school  by  each  student  in  this  class. 


The  expresisloh  data  vector  is  used  as  a  sjrnonym  for  batch.    A  specific 
datum,  or  batch  value,  is  an  obsefvatibn  or  an  element  of  the  data 
vector.     Hence,  the  batch  of  family  incomes  for  Pittsburgh  census 
tracts  has  186  total  observations. 

It  is  convenient  to  have  a  mathematical  representation  for  a 
batch  of  numbers  and  the  observations  in  the  batch.     In  QMPM  a  capital 
letter,  such  as  X,  is  used  to  denote  an  entire  batch  of  numbers.  Each 
individual  observation  is  identified  by  attaching  a  number,  written 
below  and  on  the  right  of  this  letter.     For  example,  the  first  obser- 
vation in  the  batch  X  is  denoted  X^,  the  second  observation  is  X^*  etc. 
The  i^^  element  is  denoted  X^.     Small  numbers  attached  to  X  that  iden- 
tify different  individual  obseirvations  are  called  subscripts .    Thus,  a 
batch  of  10  numbers,  denoted  X,  can  be  written  X^ .X^ .X^ ,X^ .X^ ,Xg ,X^ ,Xg ,Xg 
and  X^Q-     The  subscripts  are  the  integers  running  sequentially  from  1 
to  10.    A  more  abbreviated  representation  of  this  batch  is  Xj^,i  = 
1,2,3,4,5,6,7,8,9,10,  or  simply  X^,i  =  1,2,. ..,10.     In  this  last  form,  i 
is  called  the  index,  which  in  this  example  runs  from  1  to  16.  The 
sequence  of  periods  is  an  abbreviation  of  the  phrase   'and  so  forth'. 
The  capital  letter  'N'  is  used  to  stand  for  the  total  number  of  obser- 
vations in  the  batch. 

Another  special  notation  is  used  to  denote  the  sum  of  a  batch  of 


X^^^+X^  is  to  be  formed;  i=k  indicates  that  the  summation  is  to  begin 
with  the  kth  element  of  the  data  vector,  and  n  indicates  that  the 
summation  is  to  end  with  the  nth  element.     The  symbol  I,  the  Greek 


numbers.     The  notation  .i^  x.  indicates  that  the  sum 

i=k  1 


X 
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capital  letter  sigma,  by  convention,  denotes  that  a  summation  is  to 
be  performed.    The  letter  i  is  the  index  and  the  summation  ranges  over 
the  valueis  k  to  n.    Listed  below  are  some  rules  for  summations: 

n        2        2        2  2  2 

25     i^    ^i  =^  ^k+1      •••-^  Vl-^^n 

3)  .||^    a  =  (n  -  k  +  1)  a 

n  rt 

4)  £      a  X.  =  a    ,Z,  x. 
^    i=k         i  3. 

For  example,  table  2  is  a  batch  of  numbers  corresponding  to  United 
States  spacecraft  launchings  per  year  for  the  years  1957  to  196A. 

[table  2  here.] 

Let  X  denote  this  batch,     thus,  X^^  corresponds  to  1,  ^2  to  17,..., 
and  Xg  to  81.     Summing  the  numbers  in  X, 

Jl  .        +  X^  f  X3  +  X^  -f  X3  -f  Xg  -f  X^  Xg 

-1    +  17  +  21  +  31  -F  49  +  71      71  +  81  «  3^2 

thus,  a  total  of  342  spacecrafts  was  launched  by  the  U.S.  between  the 

years  1957  and  1964,  inclusive  (i.e.,  including  the  two  'end'  years, 

  _       n      2  - 

1937  and  1964).    As  an  exercise,  you  should  verify  that  X^  =  20736. 

Chapter  14  of  Rosenbach,  et.ai.   (1963)  and  Appendix  A  of  Hays 

(1973)  discuss  summations  in  greater  detail,  with  some  examples  and 

problems . 


E3 
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TABLE  2 
U.S.  Spacecraft  Launchings 


Number  of^XaunchitiRS 

1957  i 

1958  17 

1959  2i 

1960  31 

1961  49 

1962  71 

1963  71 
1564  81 


64 


XVI. 1.16 


Module  I 


Section  3.    Data  Sets--Variables  and  Various  Trahsfbrxnatibhs 

A  single  batch,  or  a  collection  of  related  batches  that  are  to  be 
analyzed  together;  is  called  a  data  set^    Data  analysis  is  concerned 
with  exploring  and  understanding  data  sets.     The  field  of  statistics 
csncompasses  both  data  analysis  and  the  study  of  variables— especially 
random  variables  or  variables  with  associated  probabilities,    A  variable 
is  a  quantity  that  may  assume  any  one  of  a  set  of  values.  Population 
of  census  tracts,  number  of  homicides  in  a  police  precinct,  and  yearly 
incomes  for  professors  in  a  major  university  are  examples  of  variables. 
A  more  'formal*  definition  of  a  batch  of  data  is  a  set  of  realizations 
of  a  particular  variable. 

A  variable  may  be  classified  into  one  of  two  types  depending  on 
the  values  it  may  assume.    A  discrete  variable  may  take  one  of  a  finite 
or  countabiy  infinite  set  of  values.     The  number  of  students  in  this 
class     ith  yearly  incomes  in  excess  of  $15,000  is  a  variable  which  may 
assume  ?.ny  member  of  the  finite  set  {0, 1,2, . . .  ,n},  where  n  is  equal  to 
the  total  enrollment  of  chis  class.     This  variable  is  discrete.  All 
crants  of  objects  or  eventc-v  are  discrete.     A  continuous  variable  may 

z  value  ^rom  a  set  of  -nfinite  size.     Length  of  a  particular 
r        surface t    !aiiy  outdoor  iemperatures ,  and  the  percentage  of  female 
.;s  in  tij.  U.S.  are  exait.  Irs  of  continuous  variables.    While  a 


uhi  pr^rcentage  of  female  students  in  the  tJ.S.,  which  cannot 

be  :.5.*5  chsv  :2ero  percent  or  greater  than  one  hundred  percent,  there 
are  an  intiniro  number  of  values  within  these  bounds.     For  practical 


.1.^  vi  :i£bie  may  be  bounf in  its  values,  as  in  the  case  of 


purpoGSi?,  if  a  variable 


on  only  integer  values,  it  is  discret'*; 


btfierwlse,  it  is  continuous.  Section  4.1  of  Blalock  (1972)  discusses 
discretes  and  continuous  data. 

Occasionally  you  will  want  to  re  express  or  t^ransforin  a  batch  of 
nunibers  in  the  process  of  performing  an  analysis.    The  most  common 
transformations  involve  raising  nuinbers  to  various  powers,  a  process 
called  exponentiation i  or  taking  logarithms  of  numbers.     Below  are 
some  general  rules  for  exponentiation: 


1 . 

(y  )(y  ) 

2. 

,  nf  nm 

(y  )  =y 

3. 

iJ. 

5. 

y   =  1 

6. 

1  -  - 
y   =  y 

7. 

y^  =  yy 

A  logarithm  is  an  important  but  easily  misunderstood  concept.  It 
is  closely  related  to  exponentiation.     Any  number  may  be  represented 
in  scientific  notion  es  p»lb'^,  where  p  is  a  number  between  1  and  10»  and 
k  is  an  integer  power  of  10.     It  is  also  possible  to  represent  any  posi- 
tive number,  N,  as  10^,  where  y  is  any  real  number.    When  a  number  is 
represented  in  this  fashion,  y  is  called  the  logarithm  of  base  10  of 
the  number  N.     Any  positive  number  may  be  used  in  a  base  of  a  logarithm. 

More  formally,  a  logarithm  of  a  base  number  b ,  of  a  number  N,  is 
defined  as  that  power  to  which  b  must  be  raised  to  obtain  N.     in  mathe^ 
matics,  given  any  N>0  and  b>0,  if  b^=N,  then  log  N:=y.     'Log'  is  an 
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abbreviation  of  logarithm  (which^  in  Greeks  means  "reckoning  number"). 

_  _      _  .     fLQQQJ  -    -  1  -3Q79ii 

For  example,  5»10  ;  hencei  ^log^gS^. 69897 .    Also,  ZS^IO""'^''"  and 

  2  _    ; 

100*10  ;  hence  log^^^ZS^l.  39794  ^  a1ftd  logj^QlOO=2.  Some  general  rules  for 
logarithms  are  given  below: 

1)  Logs  "come"  in  various  bases;  however^  all  logs  to  different  bases 

differ  only  by  a  multiplicative  constant.     Specifically,  if  a  and 

b  are  any  2  bases,  then  log  N=logrN  log  b  (log  b  is  the  inultiplica- 

a         0         a  a 

tive  constant  for  this  particular  conversion).     Because  of  this 
mathematical  fact,  any  base  is  essentially  as  good  as  any  other. 
However^  for  various  reasons  of  convenience  some  bases  are  pre- 
ferred in  certain  contexts. 

2)  In  QMPM  logarithms  to  the  base  10  are  used  exclusively.     These  are 
written  log^g^  or  merely  log.     This  choice  is  prompted  by  the 
decimal  number  system.     Base  10  logs  are  called  ^oinmon  logs . 

3)  The  second  most  useful  base  is  "^fc^^  irrational  number  approximated 

  _  _  _    _       _     ._      .      .  .  _      _   _  _ 

by  2.71828...  which  is  simply  denoted  by  the  letter  'e'  in  honor 

of  the  mathematician  Euler.  ""^^  number  e  plays  an  important  role 

in  calculus^  as  well  as  in  other  areas  of  mathematics.     It  occurs 

frequently  in  economics.     Logs  to  the  base  e  are  written  log^  or 

In  (for  Napierian  or  Natural  logs.) 

4)  log(l)=0 

5)  log(O)  is  undefined,  that  is  log(O)  =  -  «. 

6)  Log  of  a  product  is  the  sum  of  the  logs; 

log  (PQR)  =  logP  4-  logQ  +  log  R 

7)  Log  of  a  quotient  is  the  difference  of  the  logs: 

log  (P/Q)  =  log  P  -  log  Q. 
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8)  Log  of  a  number  to  a  power  is  found  by  multiplying  the  log  of  the 
number  by  the  power: 

log  (p")  -  nlog  P 

9)  Log  of  a  root  of  a  number  is  found  by  dividing  the  log  by  the  root: 

log  )  -=  i  log  P. 

All  of  these  rules  for  the  use  of  logs  derive  from  the  basic 
definition  of  a  logarithm  and  the  rules  of  exponentiation.     Logs  and 
exponentiation  are  discussed  in  Chapter  3  of  Paul  and  Haeussler  (1973). 

Section  4.  Percentages 

Familiarity  with  percentages  is  essential  in  a  policy  oriented 
quantitative  methods  course.     Many  data  sets  contain  variables  that 
are  originally  recorded  as  percentages ^  and  often  analyses  are  requested 
in  terms  of  percentage  change.     A  percentage  is  a  portion  of  a  number 
expressed  in  hundredths.     The  following  mathematical  statement  is 
common:     A  is  B  percent  of  the  number  C.     Since  percents  are  expressed 
in  hundredths,  B  percent  is  equivalent  to  B/lOO,  and  the  above  statement 
may  be  written  A  =  (B/lOO) (C) . 

There  are  three  common  situations  encountered  when  using  percentages 

1)  A  is  unknown,  B  and  C  are  known. 

2)  B  is  unknown,  A  and  C  are  known 

3)  C  is  unknown,  A  and  B  are  knovm. 

Each  of  these  situations  is  discussed  in  turn  below. 

The  first  problem  is  generally  stated        percent  of  C  equals  what 
number?'*  The  answer  is  found  by  multiplying  (B/100)xC.     For  example, 
35  percent  of  120  equals  (35/100)xl20  =  .35x120  =  A2. 
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The  second  problem  is  stated  "A  is  what  percent  of  C?**    The  answer 
is  found  by  dividing  A  by  C  and  multiplying  the  result  by  100%;  i.e.^ 
if  B  is  the  correct  answer,  B  «  (A/G)xlOO%.     For  example,  to  determine 
what  percent  36  is  of  144,  calculate  B  «  (36/144)xlOO%  =  (.25)xl00%  =  25%. 

The  last  problem  occurs  when  the  whole  or  base  number  C  is  unknown. 
It  is  usually  stated  "A  is  B  percent  of  what  number?"    The  answer  is 
found  by  dividing  A  by  (B/lOO) ,  €  =  A/(B/100).     If  A  is  90  and  B  is  40%, 
then  e  =  90/(40/100)  =  90/. 14  =  225,  that  is,  90  is  40%  of  225. 

Chapter  3  of  Blalock  (1972)  is  a  good  reference  for  percentages, 
as  is  chapters  1-3  of  Zeisel  (1968) . 

Section  5.     Plots  and  Graph  Paper 

In  Unit  2  plots «  or  graphs,  of  pairs  of  bbservatibhs  are  made.  So 
that  graphs  can  be  read  easily  some  conventions  have  been  established. 
The  horizontal  axis  is  called  the  x-axis  and  the  vertical  axis  is  called 
the  y-axis.     The  x-axis  is  placed  at  the  bottom  of  the  page^  and  the 
y-axis  at  th3  lefr  side  of  the  page.    A  point  on  the  graph  is  represen- 
ted as  {x,y).     Figure  \  illustrates  these  preliminary  steps  for  a  plot 
or  a  T:;ige  of  ord^.^jtry  ^iiaph  paper. 

The  graph  pap*5r  in  Figure  1  has  linear  scales  in  aoth  the  x  and  y 
directions,     Tnis  is  the  type  of  graph  paper  that  is  used  most  in  this 
cours2.     Some  roies  for  improving  the  appearance  of  a  plot  are  given 
t^^icw,  tn  brief: 
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1)    Hake  plots  "tall  and  thin"  (y-axis  longer  than  x-axis)  or  "short  and 
wide"  (y-axis  shorter  than  x-axis)-- whichever  more  effectively  con- 
veys your  message. 

tall  and  thin 


short  and  wide 


2)  Use  graph  paper  with  light  rulings  for  the  units ^  heavy  rulings 
every  ten  units,  and  intermediate  rulings  every  five  units  in 
between.     The  paper  in  Figure  1  has  these  intermediate  lines, 
which  make  a  large  difference  in  speed  and  accuracy  (when  plotting). 

3)  fie  clever  in  assigning  numerical  values  to  the  basic  unit-- units 
other  than  1,2^  or  5  times  a  power  of  10  are  too  awkward  and 
tedious. 

4)  In  the  finished  version  of  the  plot^  do  not  clutter  the  plot  by 
having  too  many  values  marked  on  the  axes* 

There  are  also  types  of  graph  paper  with  non-linear  scales.  Such 
paper  can  save  a  lot  of  time  whur   plotting  logarithms  of  the  observa- 
tions.    One  example  is  semi-log  paper,  as  seen  in  Figure  2,  with  a 
logarithmic  scale  for  the  y-axis«     Note  that  on  the  log  scale  the 
physical  distance  from  10  to  100  equals  the  distance  from  100  to  1000. 
This  distortion  or  shrinkage  is  because  log(IO)  is  one  unit  away  from 
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log(l6d),  i.e.  Ibg(10)+l=loga005,  and  lbg(106)  is  one  Uiiit  away  from 
log(lbbO).    Another  example  of  graph  paper  with  non-linear  scales  is 
log-log  paper  which  is  illustrated  in  Figure  3.    Here,  both  axes  have 
logarithmic  scales.     Both  semi-log  and  log-log  graph  papers  will  be  of 
use  when  transforming  batches  by  taking  logarithms  of  the  observations 
because  you  can  go  directly  from  an  observed  value  to  its  logarithm 
by  simply  finding  the  value  on  the  graph  paper's  logarithmic  scale.  This 
operation  makes  it  unnecessary  to  first  calculate  the  logarithm  using 
tables,  a  calculator,  or  a  computer. 

Plots  are  very  important  in  this  course.    You  should  reacquaint 
yourself  with  the  basics  of  plotting: 

1)  labelling  the  axes 

2)  locating  points  in  the  x-y  plane  using  the  abscissa^  x-coordinate, 
and  the  ordinate ,  y-coordinate,  of  each  point. 

Paul  and  Haeussler  (1973)  discusses  graphing  in  Chapter  3,  Section  3, 
and,  in  general,  can  serve  as  a  useful  reference  volume. 
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Co.,  second  edition,  1976. 

Rosenbach,  Joseph  B.  et.al.,  CoHage^Al^g^ra  with  Trigonometry^ 
Biaisdeii  Publishing  Co.,  1963. 

Tukey,  John  W.  ,  Exploratory  Data-Analysis,  Addison-Wesley  Publishing 
Co.,  1977. 

Zeisei,  Hans,  Say  It -with  Figurfes^,  fifth  edition  Revised,  Harper  and 
Row  Publishers,  1968. 

Ail  data  sets  were  taken  from  Statistical  Abstract  of  th#-United 
States^ &7 5  U.S.  Department  of  Commerce,  Bureau  of  the  Census, 
Washington,  D.C.  1975. 
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Homjewbrk 

Prerequisite  Inventory,  Units  1  and  2 

li  logio  7.8+  logio  2.0  -  logjo  .5  = 

2.  Ibsio  (9.1  +  .9)  - 

3.  log7  1  - 

4.  (5.6  •   logjo  100)  •   (.2  •  logjg  1000)  = 

Write  the  solutions  to  problems  5-8  in  scientific  notation. 

5.  3.524  •  10^  +  .6476  *  10^  = 

6.  (1.2  •  10-^)  i  (.6  •  10-6)  ^ 

7.  47569.532 

a.  to  five  significant  digits  = 

b.  to  three  significant  digits  = 

8.  (34  •  2  logj  27  +  52.9  •  logs  1)  '  ^°H2  = 

9.  logg  56  . 
logg  56 

10.  log^  5  •  logs  ^5  " 

7B 

11.  5/6  of  23  2/3  is 

12.  12  is  what  percent  of  300? 

XV1;I;28 
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13.  Change  the  fraction  7/8  to  a  percent. 

14.  Change  .1%  to  a  declooal. 

15.  38  is  20%  of  vhat  number? 

16.  If  a  man  has  $1500  in  the  bank  and  the  annual  interest  rate  is  5%, 
how  much  will  he  have  in  the  bank  after  one  year? 

17.  Is  the  square  root  of  25  a  rational  or  an  irrational  number? 

18.  Is  ^  VI25  an  integer? 

o_    _     

19.  Is  (-3)    a  positive  real  number? 

2b.    Can  an  irrational  number  ever  be  an  integer? 

21.  Which  of  these  is  7  •  7  •  7  •  7? 
47  7^  43  142 

22.  (5°)  (8^)  *  ^ 

23.  (il^)  (11^)  * 

24.  15^  i  15^  m 

3-3 

26.  (-3)  -  (-7)  « 

27.  (-3)  -  (+3)  +  (+3)  « 

77 
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28.     (-3)    equals  vhlch  one  of  the  following? 

3^  -21  -3^  -3  3* 


29.  Round  the  following  values  to  integers, 
a.  1093.91 

c.  '^2  ^ 

d.  33.33 
0.2 

f.  -0.956 
g-  -.001 

h.  -1.77 

30.  The  computer  has  generated  calculations  on  your  data  that  are 
significant  to  only  3  digits.    Cut  the  following  values  to  3 
significant  digits. 

a.  1.0992  •  10^ 

b.  7.7109  •  10'^ 

c.  8.0084  •  102 


31.     If  you  have  negative  values  in  a  data  batch  can  you  make  a 
logarithmic  transfonnation  on  the  raw  data? 


32.     If  you  have  fractional  values  in  a  data  batch  can  you  make  a  square 
root  transformation  on  the  raw  data? 


33.     If  s  >  b  and  b  >  c,  then  which  of  the  following  statements  is  true? 

a  >  c 
a  -  b  >  c 
ab  >  be 
abc  >  0 


34.    Arrange  the  following  fractions  in  increasing  order:     -2/5,  -1/2,  1/5 
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Questions  35-39  peri:aln  to  the  following  data  vector: 

Xi  -  1.64 
1.72 
1.68 
1.77 
1.56 
1.95 
1.78 
1.91 
1.97 


1.77 
1.75 
1.93 
1.78 
1.71 
1.63 
1.76 
1.55 
1.66 
1.49 
1.64 
1.70 
-  1.68 


This  data  vector^  X,  con tains  the 
hundred  companies  conducted  by  24 
and  Health  Administration  offices. 
X2  .   .   .  X24  denote  the  number  of 
each  office. 


number  of  inspections  in  unlta  of  a 
regional  federal  Occupational  Safety 

Let  1  denote  the  office  and  Xjj 
inspect ions  in  hundreds  conducted  by 


3f.    The  actual  number  of  inspections  conducted  by  office  Xg  is 

36.  Office  X4  is  in  Bbstbii;  office  Xn^  Is  in  Seattle.    How  many  more 
ihspectlbns  did  Boston  conduct  that  Seattle? 

37.  Offices  Xi>  X2>  .  .  .  j  X5  are  in  the  northeast. 
Offices  Xj,  Xq,  .  .  .  ,  Xj2  *re  in  the  southeast. 
Offices  X13,  Xi4,  .  .  .   ,  XiQ  are  in  the  southwest. 
Offices  Xj9,  X20,  .  .  .   ,  X24  are  in  the  northwest. 

What  notation  would  yoc  u»f  to  indicate  the  sum  of  all  the 
inspections  in  the  soathw  Ti^t? 
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38.    ffhat  ts  the  total  number       inspections  that  took  place  In  the 
•ootheast? 


39.    Are  these  data  values  discrete  or  continuous? 


19 

40.  2  3- 
1-10 

24 

41.  E  4  Xi  - 

1-5 


42.         -  1,  3,  5  ......  11 

Z    Xi  ^  » 
1-2 


For  problems  43-46  state  whether  the  variable  described  Is  discrete  or 
continuous. 


43.    The  proportion  of  blacks  in  each  census  tract  in  Pittsburgh. 


44.    The  number  of  persons  living  in  Pittsburgh  that  are  black. 


45.    The  number  of  traffic  fatalities  in  the  U.S.  in  1975. 


46.    The  percentage  of  vehicle  defect  caused  traffic  fatalities  in  the 
U.G. 


So 
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<|u««Ctoas  47-50  refer  to  Che  foilovtng  graph. 
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120 


115 


110 


105  -- 


100 


•  C 


H  H- 

20 


•  A 


■+—  .1  I  T 


•  E 


10 


30 


AO 


47<    List  the  coordinates  for  points  A-E. 


48.    Order  points  A-E  by  increasing  values  of  the  abscissa. 


49.    What  is  the  horizontal  distance  between  points  A  and  E? 


50.    What  is  the  vertical  distance  between  points  C  and  E? 


SI 
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Homework  Solutions 
Prerequisite  Inventory  Units  1  and  2 


1 

loo       "ll  2 
10 

XX 

2. 

1 

24. 

15^ 

3. 

0 

25. 

27 

4. 

6.72 

26. 

4 

5. 

1.000   •  10^ 

27. 

-3 

6. 

2.0  •  10^ 

28. 

.3' 

7. 

a.     4.7569  •  10^ 

29. 

a. 

1094 

b.    4.75  •  10^ 

b. 

1 

8. 

4.08  •  10^ 

c « 

1 

9. 

Ibgg  6 
o 

d. 

33 

10. 

log-  49  or  * 

e. 

0 

ii. 

19  13/18  or  355/18 

f . 

-i 

12. 

4% 

g. 

0 

13. 

87.5% 

b. 

-2 

14. 

.001 

30. 

a. 

1.09  •  10^ 

1  on 

b. 

7.71  .  10"^ 

16. 

$1575 

c . 

8.00  •  10^ 

17. 

rational 

31. 

no 

is. 

yes 

32. 

yes 

19. 

ho 

33. 

a  > 

c 

20. 

ho 

34. 

-1/2.  -2/5.  1/5 

21. 

-7' 

35. 

195 

22. 

8 

36. 

7 

82 
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18 

38.  1116 

39.  discrete 

40.  30 

24 

41.  4    r  X. 

1=5  ^ 

42.  83 

43.  Continuous 

44.  Discrete 

45.  Discrete 

46.  Continuous 

47.  A  (25,105) 
B  (40aiO) 
C  (20,120) 
D  (30,115) 
E  (50,120) 

48.  C  A  D  B  E 

49.  25  units 

50.  0 
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Units  1  and  2 
Reading  Assignments 

Readings  should  be  completed  before  the  indicated  lecture. 


Lecture 


l-'O 


Workshop 


1-1 


1-2 


Workshop 


1-3 


1-4 


2-1 


2-2 


Heading 

"Introduction  to  CJMPM" 
Tufte,  Chapter  1 
Tukey  &  Wilk 

Prerequisite  Inventory 
Tukey,  pp.  1-7 

Tukey,  pp.  7-26 
McNeil ,  pp.  1-6 
Wallis  &  Roberts,  pp.  177-132 

Tukey,  Chapter  2 
McNeil,  pp.  6-9,  16-17 

"Some  Principles  for  Graphics  of  Tables  and  Char 
Huff,  pp.  60-1A2 

Tukey,  Chapter  3 
McNeil,  pp.  9-16 

Huff,  pp.  7-36 

Wallis  fit  Roberts,  pp.  182-193 
Tanur,  pp-  229-236 

Tukey,  pp.  97-101 
McNeil,  pp.  27-31 

Profiles  in  School  Svport,  pp.  15-69 

Tukey,  pp.  102-115 

NcNeil,  pp.  32-39 

Fairley  &  Mosteller,  pp.  87-109 


Texts: 

Fairley,  William  B.  and  Frederick  Mosteller,  Statistics  and  Pabiic 
Pdl4&y,  Reading,  Mass.:     Addison-Wesiey  Publishing  Co.,  1977. 

Huff ,  _D3rrell,  How  t^  4JL€-^lth  Statistics,  New  York:    W.W.  Norton 
5  Co. .  1954. 

McNeil,  Donald  R. ,  Interactive  Data^toalysis ,  New  York:     John  Wiley 
fit  Sons,  L<)77. 
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"Profiles  in  School  Support prepared  for  the  National  Center  for 
Education  Statistics  by  Eugene  Pi  McLoone,  Washington^  D.C.:  U.S. 
Government  Printing  Office^  1974, 

Tanur,  Judith,  et  al.     Statistics;     A  Guide  to  the  Unknown,  San 
Francisco :     Holden- Lay ,  Inc . ,  197 2 . 

fufte,  Edward  R.,  Data  Analysis  for  Politics  and  Policy,  Englewood 
eiiff  ,  N.J.:     Prentice-Hail,  1974. 

Tukey,  John  W. ,  Exploratory  Data  Analysis,  Reading,  Mass,:  Addison- 
Wesley,  1977. 

Tukey,  John  W.  and  M.  B.  Wilk,  Data  Analysis  and  Statistics;  An 
Expository  Overview,  AFIPS  Conference  Proceedings,  volume  29, 
1966,  pp.  695-709. 

Wallis,  W.  A.  &  H.  V.  Roberts,  Stc?.tis tics ;     A  New  Approach,  Glencoej 
ILL.;     Free  Press,  1956. 


Lecture  1-0.     incroduction  to  QMPM  and  Unit  1 

General  Introduction  to  Quantitative  Methods  for  Public  ManagemeSt  and 
Specific  Introduction  to  Unit  1,  Analysis  of  Single  Batches  of  Data 

Lecture  Content; 

1.  Discuss  purpose  and  organization  of  course,  and  the  nature 
of  data  analysis 

2.  Introduction  to  the  cbjecr.ives,  problem,  a.id  notation  of 
Unit  1 


Main  Topics; 

1.  Introduction  to  QMPM™Detailed  structure 

2.  Introduction  to  Uiiit  1 


SG 
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Topic  1.     Introduction  to  QMPM — Detailed  structure 

i.     Nature  of  Data  Analysis:    Numerical  detective  work 
What  does  a  data  analyst  do? 

1.  Take  apart  data  to  find  structure:     Dci-v-  Signal 
(Fit)  +  Noise  (Residual) 

2.  Familiarity  with  various  forms  of  data  sets  and  ways 
of  "handling*'  them 

3.  Analytical  methods  to  take  data  apart 

4.  Exploration  preceding  confirmation — Detective  work 
(Investigative  vs.  Judicial  evaluative  process) 

5.  Iterative  and  Interactive  process--uses  data  as  guide  to 
procedure 


II.     Structure  of  QMPM 

1,      Rationale:    Analysis  for  decision  making 
Problems  facing  data  analysts 

a.  Public  managers  need  to  make  effective  decisions 

b.  Must  be  able  to  process  data,  present  results  to 
nonquarititative  audiences 

c.  Relevant  data  usually  quantitatl^  "messy"  or 
"dirty"  (measurement  error,  NA  ^  nes 


d.  Analyses  are  usually  unplanned,  hoc,  second  hand 

e.  Implication — need  operational  analytic  skills  that 
can  handle  data  problems  and  change  data  into 

liif  ormation 

2.      Objective:     Provide  these  necessary  skills 
Students  learn  to: 

a-     Gather  and  prepare  data 

b.  Analyze  data  to  uncover  structure  and  evaluate  the 
analysis 

c.  Present  data  and  interpret  analytic  results  for 
improved  communication 
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3.    Philosophy:     Preparation  for  practice 
QMPM's  emphasis: 

a.  Quantitative — numeric  data,  relating  to  quantities 
or  measures 

b.  Real  data—erapirical ,  based  on  real-life  observation 
or  experiment 

c.  Policy  relevant  data—data  used  in  making  decisions 

d.  Graphics — visual  displays  or  pictures 

e.  Resistant  and  robust  techniques—unaffected  by 
deviant  values  or  erroneous  assumptions 

f.  Data  analytic--not  just  statistical;  models  for  data 

g.  Computer  orientation— exploits  special  user-oriented 
computing  system 

ill.     Ct         :  -jalization  and  organization  of  course:  Elaborate 
sii^^:tt^^ey  requires  cooperation  between  students  and  in- 
structional staff 

1.  instructors — Identify,  offices,  office  he  -rs 

2.  Module/Unit  design 

3.  introductory  lectures— precede  substantive  learning 
units 

A.     Prerequisite  inventories— establish  base  upon  which  new 
skills  are  built 

5.  References  and  texts— describe 

6.  Computer  system— location ^  staff— if  used 

7.  video  tapes—location,  staff--if  usied 

8.  DataBank — documentation,  staff--if  used 

9.  Calculators — promote  purchase 

10.  Homework — frequency,  schedule 

11.  Workshops — 1  per  week,  flexible  role  in  course 

12.  Calendar  88 
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Topic  2.     Introduction  to  Unit  1,  Analysis  of  Single  Batches  of  Data 
I.     Introduction  to  the  objectives  of  Unit  1 
1.     Questions  to  be  answered  in  Unit  1 

a.  What  is  a  batch  of  numbers? 

A  set  of  similar  values  obtained  in  some  consistent  (i 
fashion  (from  Prerequisite  Inventory,  Module  I) 

b.  What  analyses  can  be  done  on  a  single  batch? 
What  can  we  say  about  a  batch? 


Skills  to  be  mastered  in  Unit  1 


Perceiving  and  recognizing  a  batch 


^-    Organizing  a  barch  to  facilitate  presentation, 
comprehension,  and  analysis 

Condensing  a  batch  to  facilitate  summarization 

Transformations  to  prorrote  symmetry 

Definition  and  recognition  of  weil-behavrd  batches 

II.     Introduction  to  the  problems  of  Unit  1 

What  is  a  batch?    Look  at  an  example, 

a.     Similar  numbers — counts  of  persons  in  census  tract 
scale 

Example:     1970  populations  of  the  186  census  tracts 
in  Pittsburgh 


(2: 


b.     Consistent  feature — data  collected  in  1970  census  (3) 

enumeration  of  city  of  Pittsburgh   

(Note  trailing  zeros  and  tract  sequence  arrangement ; 
discuss  notion  of  "census  tract**    and  source  of 
census  data) 

Example:     1970  populations,  in  thousands  of  persons  (4) 
(comment  on  units) 

•Jhat  can  we  say  about  a  batch?  What  are  its  features?  (5) 
How  can  we  summarize  it? 

a.     Minimum  value — how  small  is  the  smallest  value  of 
the  batch? 
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b.     Maxiinuin  value — how  large  is  the  largest  value  of  the 
batch? 


c.  Typical  value — what  is  the  "average"  value  of  the 
batch? 

d.  Variability — how  spread  out  is  the  batch? 

e.  Uniformity — how  clustered  is  the  batch? 

f.  Shape — how  symmetric  is  the  batch? 

Example :     i'opuiation  data  again  (6) 

a.  Minimum  value — 334  persons,  tract  185 

b.  Maximum  value — 7916  persons,  tract  95 

c.  Cannot  answer  reDainin^^  questions 

example:  Number  of  blacks  in  each  of  the  census  (7) 
f  Pittsburgh  in  1970 

Minimum  value — 0  Blacks,  tracts  165,  129,  146,  171, 
176,  177,  185 


b.  Maximum  value — 4611  Blacks,  tract  74 

c.  Cannot  answer  remaining  questions^  but  note  large 
numbe     of  small  values 

5.  Conclusion 

a.  Need  methods  to  organize  data 

b.  Need  tools  to  summarize  impo -t-int  features 

c.  Methods  should  be  easily  performed 

d.  Summaries  should  be  readily  comprehended 


III.     Introduction  to  Notation  of  Unit  1 

(See  Prerequisite  Inventory*  Module  I,  for  reference  text) 

1 .  Conventions 


(8) 


a.     Capital  letter  ("X*')  denotes  entire  single  batch  of 
values 


b.     Individual  values  identified  by  single  subscripts 

90 
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Example:     Pittsburgh  poj-ilation 

a.  Let  X  =  Population  or  Pittsburgh  census  tracts  in  1970 

b.  Let        =  Population  of  tract  1  =  972  persons 
Let        =  Population  of  tract  2  =  4082  persor  > 

•  • 

Let  X-o-  =  Population  of  tract  185      334  persons 
(Note  arbitrariness  of  assignment  of  tract  numbers) 

c.  In  general,  there  are  n  tracts  (in  this  case,  n  =  185) 

Thus  X    =  X,oc  =  334  persons 
n        io  J 
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Lecture  1-0 
Transparency  Presentation  Guide 


Lecture 
Outline 
Ldcatioft 

Topic  2 
Section  I 

l.a 

2 

Section  li 
l.b 

l.b 


Sec"  on  III 
1-2 


Transparency 
Number* 


3 
4 


Transparency  Description 

Definition  of  Batch  of  Data 
Topics  for  Unit  1 


1970  Populations  of  Pittsburgh 
Census  Tracts 

1970  PDpul£>:ions ,  in  Thousands 
of  Persons 

Questions  to  be  Answered  for 
Single  Batches 

1970  Populations,  Maximum  & 
Minimum  Indicated 

1970  Black  Populations  of 
Pittsburgh  Census  Tracts, 
Maximum  &  Minimum  Indicated 


Conventions  &  Exatnple  of 
Notation 


Refers  to  numbers  in  parentheses  on  righchand  side  cf  1  .cture 
outline. 

'^Bracketed  ti r'nspare^icies     are        :  '   ':       ^rr  ;^      he  same  page. 


Modal e  ] 


01 


tot^^t^    ^  tAo.'Vct   ^  ta,4x::- t>c.s:  ar^lj 

par  4  <'  cuj  <xc-    fea."(-ocr-e  . 


•  I      .  n?  /  /    r      X        '  J  • 


s 
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972. 

4082. 

1972; 

391. 

3062. 

2919. 

2424. 

6667. 

2065. 

2973. 

3712. 

2505. 

4187. 

1050. 

1645. 

3629. 

1867. 

1359. 

2655. 

1876. 

2471  . 

728. 

1205. 

2382. 

765. 

2776. 

2135. 

1349 

5747  . 

2135. 

1330. 

4730. 

2155. 

4247  . 

1472. 

38  32. 

1253  . 

2316. 

3692.. 

^4060. 

2744. 

3228  . 

5?69. 

2979. 

.i2h. 

3268  . 

5319. 

5435  . 

-204^. 

152;. 

4615. 

3994. 

7910; 

1965. 

5203  . 

484. 

5680.. 

3156. 

3578  . 

2398  . 

1424. 

47i9. 

3509. 

6796. 

5371. 

996. 

2607. 

2396 

1670. 

2297. 

335. 

6235. 

3121. 

2670. 

1227. 

1159. 

2579. 

3338  . 

2779  . 

345. 

4327  . 

1792. 

2932: 

2125. 

1056. 

3103. 

2487. 

1193. 

3291. 

1289. 

3853  . 

3905. 

2857. 

3812. 

2612. 

1640. 

3921  . 

6242. 

5816. 

€527  . 

955. 

3297. 

3413. 

334. 

631. 

735. 

Jt9:J8  . 

729. 

3689. 

2437. 

1919. 

3294. 

44*2. 

453  . 

1645. 

2442  . 

2915. 

1405. 

2386  . 

3122. 

1619. 

1410. 

3628  . 

4415. 

3153. 

2942. 

3469. 

409S. 

1452. 

3378. 

1971. 

3945  . 

2602. 

846. 

4014. 

1645. 

5300. 

3133. 

4  520. 

6003. 

2068. 

1577; 

2064. 

3188. 

4392. 

4758. 

3962. 

3752. 

1864. 

3820. 

2619. 

1418. 

5630. 

3765  . 

7425. 

2574  . 

663. 

2658. 

791. 

1556. 

1343 

-569; 

568. 

442. 

2967. 

2254. 

2569. 

2^^5. 

1963. 

-719; 

1644. 

4561  . 

3609. 

4437. 

1399. 

2144. 

992. 

1906. 

3425. 

S360. 

1269 

2629 

94  lo 
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•  .972 

i.062 

2.085 

4.187 

1.867 

2.471 

0.76S 

5.747 

2.155 

1.253 

2.744 

5.269 

S.319 

1.521 

1.985 

1.156 

4.719 

0.996 

2.297 

2.67 

1.338 

1.792 

3.10J 

i.269 

1.612 

6.242 

1.297 


4.082 

2.919 

2.973 

1.05 

1.359 

0.728 
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Lecture  1-1.    Organization  for  Analysis 

Organization  for  Analysis;  The  Use  of  Numeric  and  Graphic  Methods  for 
Analytical  Organization  of  Single  Batches 

Lecture  Contents 

^*  methods  for  recording  and  presenting  a  batch  of 

data  in  an  organized  manner. 

2.     Show  how  such  tools  convey  various  batch  characteristics. 

Main  topics; 

1.  Methods  for  organizing  a  batch 

2.  Questions  to  ask  of  a  batch 

Tools  introduced; 

1.  Sorted  batch 

2.  Histogram 

3.  Stem-and-Leaf  Display 
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Topic  1.    Methods  for  Organizing  a  Batch 

I.    Basic  Issue:    Organization  of  data 

1.  Arbitrary — the  tnanner  In  which  data  are  usually  gathered, 
recorded  or  transmitted 

a*    At  the  data  collector's  discretion — i,e,,  a  matter 
of  convenience 

b.  Cdntextually  defined — e.g.,  stations  on  transit  line 

c.  May  depend  on  data  gathering  procedure — e.g.,  census 

d.  Obscures  behavior  of  batch  values 

e.  Makes  summarization  and  analysis  difficult 

2.  Analytical — the  manner  in  which  we  desire  to  arrange  data 

a.  Consistent 

b.  Context  free 
c«  Reliable 

d.  Conveys  behavior  of  batch  values:     shape ^  spread^ 
location,  outliers 

e.  Simplifies  continued  analysis  of  the  batch 

II.    Problem:    Analyst  often  must  use  data  which  come  arbitrarily 
organized 

1.  Arbitrarily  organized  data  are  unwieldy 

2.  Such  data  do  not  permit  ready  descrlptldh 

3.  Such  data  do  not  permit  conclusions  to  be  dravm  about 
batch  behavior — cannot  get  a  "feel"  for  the  batch 

ill.     Solution:     Simple  and  understandable  tools  for  analytical 
organization 

1.  Simplest  method—Sorted  batch  (2) 

2.  Classical  metJidd—Hlstbgram  (3) 

3.  Exploratory  method--Stem-and-Leaf  display  (4) 
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IV.  Methods 

Only  analytical  organizationof  data  is  discussed  ih_ this 
lecture. - -The  techniques  covered  can  be  applied  to  all  types 
of  batches  of  data •    fhey  are  visual  displays  that  are  easily 
appreciated  cognitively  and  help  the  data  analyst  by  addressing 
the  problem  of  what  to  examine  in  a  batch.     Universal  rules 
for  reliable  and  quick  construction  are  presented. 

1.  Sorted  batch:     a  simple  organization^  an  array  of  values, 
ordered  from  smallest  to  largest 

a.  Example  shows  a  sorted  batch:     1970  populations  of 
Pittsburgh  census  tracts 

b.  Features 

i.     Simple  idea 
ii.     Retains  information  on  individual  values 
iii.     Operationally  difficult  to  construct 

c.  Analytic  qualities 

i.     Largest  and  smallest  values  identifiable 

ii.     Ability  to  locate  order  statistics  (explain 
"counting  in") 

d.  Procedure:     arrange  data  in  increasing  order 

e.  Sorted  batch  constructed  by  computer: 

In  the  session  introducing    OIU-DAP  system. 

2.  Hlstrgran:    A  bar  graph  which  visually  presents  some  of 
the  information  in  a  batch 

a.  Example:    Histogram  of  1970  populations  of  Pittsburgh 
census  tracts 

b.  Features 

i.     Reasonably  interpretable 
ii.     Common  technique 
iii.     Formal  definition 
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iv.  Loses  information  on  individual  values 
V.    Operationaiiy  difficult  to  construct 


c.  Analytic  qualities 

i.     Shape-- separatibh,  symmetry ,  irregularity,  and 
clustering  of  values 

11.     Spread— variation  of  values 

d .  Procedure: 

(Draw  histogram  of  1970  populations  of  Pittshurgh 
census  tracts  on  blackboard,  explaining  each  step.) 

i.    Draw  vertical  (y)  and  horizontal  (x)  axes  oh  a 
sheet  of  ordinary  graph  paper 

ii.  On  hbrizbntal  axis, mark  off  smallest  data  value 
and  highest  data  value  iii  batch,  using  the  scale 
of  the  axis;  in  this  case^  0.3  arid  7.9  thousand 

iii.    Divide  this  interval  into  the_ desired  number  of 
"bins"  of  equal  size  for  display.     It  may  be 
necessary  to  round  the  smallest  value  down  and 
the  largest  value  up  to  obtain  a  convenient  width 
for  each  subinterval.     For  these  datai  use  8  bins 
of  width  1000. 

iv.     Record, number  of  data  values  falling  into  each 
bin.     This  information  is  needed  to  determine 
height  of  each  bar. 

V.    Mark  off  vertical  axis  to  correspond  to  number 
of  data  values  per  bin 

vi.    Draw  in  bars 

vii.    Can  also  have  intervals  of  unequal  size,  and  can 
combine  intervals  to  produce  a  "squeezed"  version 
or  break  up  intervals  to  produce  a  "stretched" 
version. 

e.    Histogram  constructed  by  computer: 

In  session  introducing  the  CMU--DAP  system 


Stem-and-Leaf  display j    An  easy  and  versatile  method  of 
organizing  a  batch  into  roughly  numerical  order; 

a.  ^xamglg:     Stera-and-Leaf  of  1970  Pittsburgh  census 
tract  populations  in  thousands  of  persons 

b.  Features 

i.     ''Face  validity** 

ii.     Retains  information  on  individual  data  values 
(display  and  storage  versions) 

iii.    Many  versions — flexible 

iv.    No  formal  rules  for  "correct"  version 

V.     Operationally  easy  to  construct 

c.  Analytic  qualities 

i.     Largest  and  smallest  data  values 
ii.     Location  of  order  statistics 
iii.  Shape 
iv.  Spread 

d .  Procedure : 

(Work  through  an  example  on  the  blackboard.) 

i.    Choose  a  convenient  unit,  or  power  of  ten,  for 
the  display 

ii.     Every  data  value  in  the  batch  is  cut  to  a  whole 
multiple  of  the  unit 

iii.     Separate  each  value  for  the  display  into  a  stem 
and  a  leaf 

iv.     Find  the  largest  and  smallest  stems 

V.    Write  down  these  stems  and  all  the  intervening 
stems  in  a  vertical  column 

vi.    Use  asterisks  (*)  to  indicate  the  number  of  digit 
represented  in  a  leaf  _ 
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v±±;    Drav  a  vertical  line 

viii.    Place  the  leaves  on  the  line  corresponding  to  the 
correct  stem 

e.    i&other  example: 

Net  migration  for  Pennsyivinia  counties  In  percent 
of  population  from  1970  to  1974 

(Do  Stem-and-leaf  of  batch  on  blackboard) 
(Set  aside  the  high  outliers) 
(Unit  =  0.1%*  single  stems) 

t.    The  outlying  counties  have  been  set  aside-rrthe 
caus^iS  of  their  large  increase  in  population 
should  be  investigated.     Counties  are:  Monroe. 
Piksj  Wayne,  Wyoming 

±1.    li  the  display  appeals  too  "squeezed",  we  can^ 
inc.-ease  the  number  of  lines  per  stem  from  1  to 
2  or  5 

f.  Another  example: 

Stem-and-leaf  of  the  Pittsburgh  populations  where 
each  stem  now  has  2  lines. 

Use         and  "."  to  split  the  stem  for  leaves  0-4 
and  5-9. 

i.    The  first  line,  labelled  *,  holds  leaves  0-4. 
ii.    The  second  line,  labelled,  .,  holds  leaves  5-9 

g.  Another  example: 


i 


Stem-and-leaf  for  the  Pittsburgh  populations  where 
each  stem  now  has  5  lines. 

i.    Line  labelled  *  holds  leaves  6-1 

ii.    Line  labelled  t  holds  leaves  2-3  ("two"  and 
"three") 

iii.    Line  labelled  f  holds  leaves  4-5  ("four"  and 
"five") 

iv.     Line  labelled  s  holds  leaves  6-7  ("six"  and 
"seven') 
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Vi    tine  labelled  .  holds  leaves  8-9 

vi.     if  a  display  appears  too  "stretched",  change  the 
unitby  decreasing  it^  and  decrease  the  number 
of  stems  per  line 

(Compare  net  migration  stem-and-leaf  displays  on 
two  different  scales.) 

vii.    Choose  the  '"best"  display  by  controlling  the 
maximum  number  of  leaves  per  line 

Rough  rule:    max  leaves/line  «  10  log^gN 

h.     Stem-and-Leaf  display  constructed  by  computer: 

In  the  session  introducing  the  CMU-DAP  computing 
system. 
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Topic  2.    Questions  to  Ask  of  a  Batch 

I.    Basic  Issue:    Otice  organized^  what  can  we  learn  from  a  single 
batch? 

II.    Try  to  answer  the  following  questions  which  relate  to  the  (id) 
batch  values: 

1;    Do  the  values  cluster  or  are  thiey  uniformly  spread? 
2;    Are  there  any  deviant  values,  outliers? 


3;     Is  the  batch  symjnetrical  or  asyimnetrical? 

4;    Are  the  values  widely  spread  out? 

5.    Are  there  any  separations  in  the  display? 


6;  What  are  the  order  statistics  of  the  batch? 
7.    Where  is  the  "center"  of  the  batch? 


III.    Methods  related  to  questions: 

Stem-and-leaf  displays  permit  us  to  answer  all  of  these 
questions.    Histograms  do  not  answer  (2),  (6)  or  (7)  com- 
pletely, since  individual  data  values  cannot  be  identified. 
Sorted  batch  does  hot  answer  (3),  (4)  or  (7)  and  (2)  and 
(5)  are  difficult  to  answer. 

(Discuss  appearance  of  each  of  the  next  2  slides;  answer  (11-12) 
questions) 


(Suggest  the  use  of  answers  to  questions  in  II  as  attempts 
to  suimnarize  batch.     Summary  is  facilitated  by  analytically 
organizing  the  batch.) 
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Lecture  Condensation  for  Description 

Condensation  for  Description:  The  Use  of  Numeric  and  Graphic  Methods  (1) 
to  Describe  the  Information  Contained  in  Single  Batches 


Lecture  Cbhteht: 

!•    Discuss  methods  for  both  condensing  a  batch  and  presenting 
the  condensed  "summaries" 

2.     Show  how  such  tools  effectively  describe  the  batch 


Main  Topics; 

1,  Condensing  a  batch  to  a  small  set  of  numbers 

2.  Adequacy  of  these  summaries  to  describe  a  batch 

Tools  Introduced; 

1,  5-number  summary 

2,  Simple  schematic  plot 

3,  Expanded  number  summary  and  schematic  plot 
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Topic  1.     Condensing  a  Batch  to  a  Small  Set  of  Numbers  and  a  Graphic 
I.     Basic  issue:     Condensation  of  a  batch 

1.     Condensation  is  '^second-order"  isummarization-- less  infor- 
mation Is  retained  than  organization  techniques 

a.  Stem-and-Leaf  display  and  histogram  give  too  much 
detail 

b.  Seek  several_easily  obtained  numbers  which, convey 
some  of  the  detail  of  the  organization  tools 

c.  ?^ore  expedient  to  "describe"  the  batch  with  these 
"number  summaries"  than  with  the  entire  stem-and-leaf 

d.  Schematic  plots  are  efficient  mnemonic  devices 

2,     Condensation  causes  a  loss  in  information  except  in 
special  instances  when  the  batch  can  be  reconstructed 
with  knowledge  of  only  a  few  values,  i.e,  when  the  batch 
is  well-behaved 


II.     Problem:     Organization  tools  are  not  convenient  summaries 
of  a  batch 

1.  Usually,  organized  batch  retains  too  much  information. 
Less  may  be  more  useful 

2.  Condense  batch  to  quantify  answers  to  these  questions: 

a.  What  Is  a  typical  value  of  the  batch? 

b.  How  much  variation  is  present  in  the  values  of 
the  batch? 

3.  Condensations  must  be  easy  to  obtain ,  effective  in  their 
summarization,  readily  interpreted,  communicated,  and 
remembered 


III.     Solutions:     Simple  and  expanded  numeric  and  graphic 
summaries  of  a  batch 

1.  Simple  numerical  method~5-number  summary 

2.  Simple  graphical  method—Simple  Schematic  (''box-and- 
whisker")  plot 
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Si    Expanded  methods-*'-'Expanded  number  suiomary  and  schematic 
plot 

£V«     Methods:    Numeric  and  Graphic  presentations  of  order  statistics 

1.  Notion  of  order,  depths,  folding,  and  "counting  in" 

(Note:    Distinguish  between  order  statistics  and  actual 
data  values) 

2.  5-number  summary :     Simple  condensation  (2) 

a.  Example:     shows  a  5-number  summary  of  1970  populations 
of  Pittsburgh  census  tracts 

b.  Features 

i.     Displays  of  some  order  statistics—median,  max, 
min,  hinges  (quartiles) 

ii.    Adequately  conveys  characteristics  of  most 
batches 

iii*     Computable,  with  some  difficulty^  from  sorted 
batch 

iv.     Easily  computed  from  stem-^and-^leaf 

V.     Cannot  be  computed  from  histogram 

vi.     Does  not  give  sufficient  detail  for  a  large  or 
asymmetric  batch 

c.  Analytic  qualities 

i.     Largest  and  smallest  data  values  ("extremes") 
contained  in  summary 

ii.    Median,  or  middle  value  of  batchy  included  as 
a  typical  value 

iii.    Hinges  (quarters) ^  or  medians  of  the  two  halves 
of  batchy  included 

d .  Procedure 

i.    Add  a  column  of  cumulative  counts,  or  "depths", 
to  a  stero-and-leaf  display  of  the  batch 
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ii.    Depths  should  be  cumulated  from  first  line  down  (3) 
and  from  last  line  ap«  toward  the  middle 


iii.     Stop  cumulating  when  cumulative  counts  in  each 
direction  are  roughly  equal 

iv.     First  number  in  summary  is  smallest  value  in  (4) 
batch,  minimum^  which  has  a  depth  of  1.  Label 
the  minimum  "E" 

V,    Last  number  in  summary  is  largest  value  in  batchi 
maximum,  which  has  a  depth  of  1.     Also  label  the 
maximum  *'E" 

vi.     Third  number  in  summary  is  middle  value  in  batchy 
median.    Which  has  a  depth  of  (N+l)i?2.  Median 
is  defined  as  single  middle  value  of  batch  (N  odd) 
or  mean  of  two  middle  values  (N  even),  and  is 
labelled  "M" 

vii.  Second  and  fourth  numbers  in  summary  are  hinges, 
medians  of  the  two  halves  of  batch.  Hinges  have 
a  depth  of  (Depth  of  M  +  l)/2,  and  are  labelled 

viii.     Arrange  the  5-number  summary  vertically  with  3  (5) 
columns : 

Column  1  =  Depths 

eolunm  2  «  Letter  Abbreviations  (E,  H,  M) 
eolumn  3  -  Values 


ix.     Tukey  calls  5-number  summary  a  "letter  value 
display" 

Another  "view":     Information  contained  in  5-number  (6) 
summary 

i.    Useful  measure  of  "spread"  of  batch  is  midspread. (7) 
Computable  from  number  summary 

Midspread  =  Upper  Hinge  (UH)  -  Lower  Hinge  (LH) 
Variability  of  batch  varies  directly  with 
midspread 

ii.     25%  of  batch  is  less  than  LH,  25%  greater  than  (6) 
UH  Hence  50%  of  batch  lies  between  hinges 
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iii.    Median    (M)  Is  a  typical  or  "central"  value  of 

the  batch.  Half  of  the  batch  is  less  than  M,  arid 

half  is  greater.  _We_ will  use  the  median  as  the 
"average"  value  of  the  batch 

iv.     Range  of  this  batch  is  also  a  measure  of  spread 
Range  »  Upper  Extreme  -  Lower  Extreme 

f.    Another  example:     Symmetric  batch 


i.     Symmetric  batch  has  median  lying  halfway  between  (8) 
the  hinges,  halfway  between  extremes^  and  halfway 
between  any  other  "folds" 

ii.     Any  batch  wherie  median  is  riot  exactly  halfway 

between  the  hinges  or  extremes  is  not  sytnmetric—it 
is  asyminetric 

^P^^^?". ^^^"Pl^ •    Median  incomes  for  families  and  (9) 
unrelated  individuals  in  Pittsburgh  census  tracts, 
1970 

(Compute  5-number  summary  from  sorted  batch)  (10) 
(Compute  5-number  summary  from  stem-and-leaf  of 
batch) 

(Try  to  compute  5-riumber  summary  from  histogram 
of  batch) 

h.     5-nuraber  suimnary  coristructed  on  computer: 

In  the  sessiori  introducing  the  CMU-DAP  computing 
system 

Simple  Schematic  Plot:     Graphical  preseritatipn  of  5-number 
summary  (Tukey  calls  this  tool  a  "box-and-whisker"  plot) 

a.  Example:     Schiematic  plot  for  median  incomes  for  (li) 
Pittsburgh  census  tracts  in  1970 

b.  Featurejs 


i.     Extremely  useful  in  discussing  appearance  of 
batch 

ii.     Some  attributes  of  batches^  such  as  symmetry, 
best  conveyed  by  this  graphical  tool 


iii.     May  be  difficult  to  recover  the  exact  values 
of  the  5-number  summary  from  the  plot 
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c.  Analytic  Qualities 

i.    Made  on  ordinary  graph  paper 
ii.     y-axis  represents  values  in  batch 
iii.     Extr(Mies,  hinges,  and  median  clearly  marked 
iv.     Shape  and  spread  of  batch  easily  seen 

d.  Procedure 


i.    Draw  a  box  that  stretches  from  hinge  to  hinge, 
crossing  with  a  bar  at  the  median 

ii.     Draw  a  line,  or  "whisker'',  from  the  box  to  each 
extreme 

iii.     Examine  length  of  box  for  information  on  spread 
of  batch 

iv.     Examine  location  of  bar  within  box,  and  box 
between  extremes  for  information  on  symmetry 
of  batch 

V.     Examine  length  of  whiskers  for  infonnation  on 
outliers 


e.    Mother  example:     Pittsburgh  populations,  with  the  (12) 
outliers  indicated 

Median  lies  halfway  between  hinges,  bat  large  number 
of  outliers  makes  the  batch  asymmetric 


f.    Another  example:     Schematic  plot  of  net  migrations  (13) 
of  Pennsylvania  counties,  1970-1974 

Schematic  plots  need  not  be  made  vertically  on  graph 
paper.     Plots  can  be  drawn  horizontally  on  regular 
paper. 

^*    ??P?^^?§_^y?^?^_®"™??^??_??^_s^^^^3^^^  plots — Adequate 
condensation  for  large  batches  (N  >  100) 

a.     Example:     Schematic  plot  of  Pittsburgh  median  incomes (14) 
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b.  Features 

1.     Emphasizes  the  outliers  in  the  batch 

ii.    Very  effective  in  condensing  the  batch 

iii.    Expanded  summary  best  presented  as  a  schematic 
plot 

c.  Analytic  Qualities 

i.    Define  fences  beyond  the  hinges  to  identify 
outliers 

ii.    Outliers  suitably  indicated  on  the  schematic 
plot 

iii.     Shape,  spread^  and  outliers  of  batch  easily 
seen 

d.  Procedure 

i.     Introduce  further  descriptive  numbers  (l5) 

Step  =1.5  xMIdspread 

Inner  fence  (f)  =  Hinge  4-/-  1  step 

Outer  fence.  (F)  =  Hinge  4-/-  2  steps 

Adjacent  values  are  data  values  closest  to,  but 

still  inside  the  inner  fences 

ii.     Data  values  betwieen  the  inner  and  outer  fences 

are  "outside"  arid  are  marked  on  the  plot  with  (i6) 
circles 

iii.     Data  values  beyond  the  outer  fence  are  "far  out" 
arid  are  marked  ori  the  plot  with  squares 

iv.    Whiskers  on  the  plot  should  be  dashed,  ending 
with  dashed  crosisbars  at  the  adjacent  values 

V.    Far  out  values  should  be  labelled  on  the  plot  in 
capital  letters 

vi.     Outside  values  arid  adjacerit  values  should  be 
labelled  on  the  plot  iri  small  lettiers 

vii.     tukey  recommends  the  use  of  a  "Fenced-Letter  (l7) 
Display",  to  reduce  clutter 

»v  « J 
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(Note:    These  definitions  of  outside  values  may 
not  be  sufficient  in  certain  cases.  Deciding 
whether  a  value_is  deviant  is  usually  a  subjec- 
tive process.    These  techniques  help  identify 
outliers  but  should  not  replace  cotntnon  sense.) 


Expanded  number  suraraary  and  schematic  plot  constructed 
on  computer: 

In  the  session  introducing  the  CMU-DAP  computer 
system. 
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Topic  2.    Adequacy  of  these  Sunmaries  ±n  Describing  a  Batch 


I.    Basic  Issue:    Once  batch  Is  condensed,  how  effectively  do 
the  summaries  describe  it? 

II,    Features  of  batch  that  must  be  included  in  condensation: 

1.  Identification  of  typical  value 

2.  betertnlnatlon  of  spread  of  batch 

3.  Location  of  outliers 

4.  Maximum,  Minimum,  and  Range  of  batch 

(Create  3  or  4  examples  of  schematic  plots  from  data 
sets,  and  present  them  either  on  the  blackboard  or 
as  transparencies.    Discuss  the  appearance  of  each, 
indicating  how  the  above  necessary  features  are 
documented.) 
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Lecture  1-2 
Transparency  Presentation  Guide 


Lecture 
Outline 
Location 

Beginning 

Topic  1 
Section  IV 

l.a 

l.d 

l.d.iv 

l.d.vii 

l.d.ix 
l.e 

l.e.i 
l.e.ii 

1.  f 
1-g 
1-g 

2.  a 

2.  e 

£•£ 

3.  a 


Transparency 
—  Number 


(overlay  3) 
5 

2 

6 

7 
6 

8 
9 
10 

11 
12 

13 

14 


Transparency  Description 
Lecture  1-2  Outline 


S-number  summary 

St^-and-teaf  display  with  depths 
5-nurober  summary  located 


5-number  summary  for 
Pittsburgh  populations 


Letter-value  display 

Information  contained  in 
5-number  summary 

Midspread 

Information  contained  in 
5-number  summary 

Symmetric  Batch 

1970  Pittsburgh  Median  incomes 

5-number  summary  of  median 
incomes 


Simple  schematic  plot 

Schematic  plot ,  5-number 
summary,  stem-and-ieaf 

Schematic  plot  of  Pennsylvania 
Net  Migrations 

Schematic  plot  of  Pittsburgh 
Median  Incomes 
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j.^.j.  15  Naders  for  Expanded  Sunmary 

3^3^ji  16  ^atony  of  a  Schematic  Plot 

a.d.vil  17  Tukey's  Fenced-Letter  display 
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Lectare  1-3.    Trans  fort&at Ions  for  Symmetry 


Transfortnatlbns  for  Syimnetry:  The  Use  of  Various  Algebraic 
Transformations  to  Promote  Symmetry  In  a  Single  Batch 


Lecture  Content: 

1.  Discuss  different  types  of  data  and  the  need  for  transforma- 
tion 

2.  Introduce  methods  of  determining  a  good  transformation 


Main^^pi^; 

J.  Units  of  measurement  and  different  types  of  data 
2.    Methods  of  determining  a  good  transformation 

Tools  Introduced ; 

1.     Transformation  Summaries 
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Topic  1.    Units  of  Measurement  and  Different  Types  of  Data 

I.    Basic  issue:    Batch  of  data  has  a  specific,  but  alterable , 
unit  of  measurement 

1.    Unit  of  Measurement 

a.    Data  always  measured  by  some  recording  instrument, 
and  data  values  in  batch  are  given  in  specific  units 


b;    Units  may  not  be  ideal  for  intended  analysis 

c.    May  need  to  alter  a  unit  of  measurement  by  transforming  (2) 
the  data,  to  obtain  a  better  unit  of  analysis 

2.     Chosen  unit  of  analysis  depends  on  the  type  of  data  to  be 
analyzed 

Amounts — never  negative,  n^y  be  "^^^y^if^S^  (3) 
height,  weight,  monetary  units,  distances,  certain 
ratios 

Counts—never  negative,  always  integer  valued 
e.g.  numbers  of  persons,  things,  or  events 

Percentages  or  numbers  bounded  on  both  extremes — take 
values  between  a  smallest  possible  number  and  a 
largest  possible  number  e.g.  percentage  Black  (between 
0  and  100%)  statistical  correlations  (between  -1  and  1) 


Differences  of  amounts  or  counts  ("balances")^"Positive 
or  negative,  unbounded  e.g.  profit  (difference  of 
monetary  amounts)  net  migration  (difference  of 
counts  of  persons 

3.     Chosen  transformation  should  niake  batch  more_syTOietric 
and,  consequently,  closer  to  being  '•well-behaved"  and 
easily  summarized 

II.    Probletn:    Need  simple  rules  for  choosing  a  transformation 


1.     Simple  rules  may  not  always  be  correct 


2.  Best  transformation  depends  on  type  of  data  to  be 
analyzed 

3.  Unfortunately,  even  best  transformation  may  fail  to 
increase  symmetry 

146 


a. 


b. 


c. 


d. 
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4.    Or^  by  Increasing  syimnetry,  transformation  may  increase 
variation,  or  produce  more  outliers 

III.    Solution:     "Correct"  transformation  depends  on  type  of  data 
and  on  spread  of  data 

1.  If  ratio  of  maximum  to  minimum  value  iis  quite  large 
(magnitude  of  Z  or  greater),  then  transformation  is 
essential. 

2.  If  ratio  of    maximum  to  minimum  is  small  (less  than  20), 
then  transformation  will  not  change  the  appearance  of 
the  batch. 

3.  Correct  transformations  are  "theotetically"  correct,  but 
may  fail  in  practice  (3) 

a.  Amounts  and  Counts  (particularly  large  counts)  - 
Logarithms  most  useful ^  so  are  square  roots 

b.  Percentages  and  small  counts—Special  "arcsine" 
transformation  very  useful 

c.  Differences— -Transform  the  counts  or  amounts  whose 
difference  is  under  consideration 


IV,  Examples 

1.  Counted  Data--Pittsburgh  populations 

a.  Take  logarithms,  base  10,  of  observations 

b.  togarithms  have  not  made  batch  symmetric.  Batch  is 
asymmetric,  trailing  out  to  the  right  instead  of  to 
left  (^) 

c.  Try  square  roots  of  observations  (5) 

d.  Sch^atic  plots  show  relationship  between  the  raw 
data  and  the  transformations 

2.  Percentage  Data — Percent  of  individuals  under  poverty  (6) 
level  in  Pittsburgh 

a.     fake  Arcsine  (Square  Root  (X))  for  transformation 
X  =  proportions,  between  0  arid  1 
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h.    Spread  has  decreasei  the  symmetry  Improved  with  special 
transformation  (  ^5 

Schematic  plot  shows  increased  symmetry ^  although 
outliers  still  present  (  8) 

ASSunts—Police  expenditures  in  Millions  of  dollars  hy  _ 
itate,  1973  ^ 

a.  Try  square  root  and  log  transformation 

b.  Logs  are  very  effective 

Difference  of  Counts—Net  migrations  for  Pennsylvania 
counties,  1970-1974  (^0) 

Stem-ahd-leaf  shows  sySnetry  but  large  outliers  (11) 


a. 


b.     Net  migration  »  Change  in  Population  ^  Number  of 
Births  +  Number  of  Deaths.     Transform  these  three 
batches  separately. 


Positive  and  Negative  values  in  the  Change  in  (12) 
Population  batch  make  transformation  impossible  (13) 
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Methods  of  beterrainiiig  a  Good  Transformation 

Basic  Issue:    Need  a  reliable  method  of  finding  a  good  ^ 
transformation 

1.  Transformation  must  promote  symmetry,  and  bring  the  outliers 
of  the  batch  toward  the  median 

_  _  .  .        _  .       .   _  _ 

2.  Restrict  ourselves  to  transformations  from  X  to  X    for  any 

value  of  R 

3.  This  form  of  transformation  includes  logs  (R«b) 

II •    Problem:    How  do  we  find  the  correct  exponent  R? 

III.     Solution:     Examine  5-number  summary  of  raw  and  transformed 
batch 

1.  Correct  transformation  will  have  median  halfway  between 
hinges  and  extremes 

2.  Simple  Ladder  of  Powers  indicates  that:  (1^) 

a.  Increasing  R  expands  the  larger  values  of  X 

b.  Decreasing  R  compresses  the  larger  values  of  X 

3*     tadder  of  Powers  useful  in  conceptualizing  how  various 
transformations  act  on  batches 

IV.    Method:    Transformation  Summaries 

1.     Example  shows  transformation  summaries  for  the  number  of 
births  in  Pennsylvania  counties,  1970-1974  (15) 

a.  Deaths  take  a  similar  transformation 

b.  ''Natural**  increase  in  population  «  Births  -  Deaths 
will  also  be  syiranetric  with  logs  of  births  and  deaths 

2*  Features 

a.  Useful  if  correct  transformation  for  type  of  data 
in  batch  does  not  promote  sjrmmetry 

b.  Also  useful  if  batch  does  not  fall  neatly  into  one  of 
the  four  types 

1-1.9 
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c.  Easily  computable  from  5-ntimber  sammary  of  raw  batch 

d.  Helps  "ziero  in"  on  the  appropriate  exponent,  R,  for 
trahsf  brmatibii 

Analytic  Qualities 

a.  Midhinge  and  Midextreroe  indicate  whether  R  should 
be  increased  or  decreased 

b.  Correct  R  has  median  =  midhinge  =  midextreme 


Upwards  trend  (M  <  thidhinge  <  midextreme)  indicates 
R  should  be  decreased 

Downwards,  trend  (W  ^  midhinge  >  midextreme)  indicates 
R  should  be  increased 

Useful  exponents: 


i. 

R  = 

1,  Raw  data 

ii. 

R  = 

2y  Squared  data 

iii. 

R  = 

1/2,  Square  roots 

iv. 

R  = 

0,  Logarithms 

R  ^ 

-1 ,  Negative  Reciprocals  (change  of  sign 

retains  order).  Rarely  will  additional  transforma- 
tions be  needed 

Procedure 

a.     Compute  5-riumber  sutninary  for  batch 


b.  Compute  Midhinge  (MidH)  =  1/2  (UH  +  LH) 

Midextreme  (MidE)  =  1/2  (Max  +  Min) 

c.  Compare  MidH*  Midi,  and  Median  (M) 

d.  If  M  <  MidH  <  Midi,  decrease  R 
If  M  >  MidH  >  MidE,  increase  R 


5-humber  summary  for  transformation  of  batch  easily 
found  by  raising  5--number  summary  of  raw  batch  to  the 
correct  exponent 


Continue  search  until  M  =  MidH  «  MidE 


XVI. I. 102 


Module 


Transformation  sonmaries  constructed  on  computer: 

a.  Use  LET  and  REEX  to  transform  batch 

b.  Use  SUM^RY  and  ESTATS  to  examine  effect  of 
transformation 

c.  Discovering  the  correct  symmetrizing  transformation 
is  iterative  process 
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Lecture  1-4.    Analysis  of  a  Well-Behaved  Batch 


Analysis  of  a  Well-Behayed  Batch:  Presentation  of  a  Special  Type  of 
Batch  and  Examination  of  its  Features 

Lecture  Content: 

1.  Define  a  well-behaved  batch  and  discuss  its  characteristics 

2.  Introduce  measures  to  summarize  this  special  kind  of  batch 


Main  Topic  is: 

1.  Definition  of  a  weli-behaved  batch 

2.  Location  and  scale  measures  for  a  well-behaved  batch 


Tools  Introduced; 
1 .  Mean 

2-    Variance  and  Standard  Deviation 
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Definition  of  a  Weil-Behaved  Batch 
Basic  Issue:    Defining  a  well-behaved  batch 

1.  Well-behaved  batches  are  theoretical  entities  and  are 
rarely  observed  empirically 

2.  Many  data  analysts  incorrectly  believe  that  well-behaved 
batches  are  common 

3.  We  discuss  them  because  of  their  role  in  regression 
analysT.^ 

A.     The  well-behaved  batch  presented  here  was  artificially 
constructed  to  facilitate  the  introduction  of  the 
definition 

Definition  ^2) 

1.  Well-behaved  batch  is:  (3) 

a.  Symrnetric:    MidH  =  MidE  =  M  (^) 

b .  Devoid  of  outliers 

2.  For  a  well-behaved  "standard**  batch  with  M  «  0,  and  (5) 
Midspread  *  1.36: 

a.  50%  of  batch  >  0;  50%  <  0 

b.  50%  of  batch  is  between  -0.68  and  0.68 

c.  80%  of  batch  is  between  -1.29  and  1.29 

d.  80%  of  batch  is  between  -1.65  and  1.65 

e.  95%  of  batch  is  between  -1.96  and  1.96 

f.  Extremes  are  approximately  -2.60  and  2.60,  but  may 
be  larger 

3.  Well-behaved  batch  has  shape  that  resembles  (in  theory)  (6) 
a  Gaussian  (or  "normal")  function 
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Location  and  Scale  Measuries  for  Well-Behaved  Batch 

Basic  Issue:    Need  for  special  summarization  tools  for  a 
well-behaved  batch 

1.  All  well-behaved  batches  have  similar  appearance 

2.  Two  well-behaved  batches  may  differ  only  in: 

a.  Location—where  batches  are  positioned  along  the  Real 
number  line 

b.  Scale--how  spread  out  the  batches  are,  amount  of 
variation  in  the  data  values 

3.  Need  to  quantify  these  concepts  to  facilitate  comparison 
of  well-behaved  batches 

II.    Problem:    Which  location  and  scale  measures  are  appropriate? 

1.  The  median:  of  a  batch  is  a  measure  of  location,  as  is  the  (7) 
mode  of  a  batch  (data  value  with  greatest  frequency  of 
occurrence)  and  the  arithmetic  average,  or  mean,  of  the 
batch 

2.  The  midspread  and  range  are  measures  of  spread-     The  (8) 
variance,  or  average  of  the  squared  differences  from 

the  mean^  also  measures  spread 

3.  The  standard  deviation*  or  square  root  of  the  variance  of 
a  batch,  in  the  same  unit  as  the  data  values,  is  also 
useful  in  measuring  the  scale  of  the  batch 

III.     Solution:    Mean  and  standard  deviation  are  the  correct 
measures  of  location  and  scale ^  respectively 

1.  in  a  well-behaved  batch  *  X,  meari^  and  M,  median,  are  (7) 
equal  to  each  other  and  to  the  tndde 

2.  In  a  well-behaved  batchy  the  standard  deviation,  s,  is  (8) 
approximately  equal  to  3/4  x  Midspread 

IV.    Methods:    Mean  and  Standard  Deviation 

1.    Example  shows  mean  arid  standard  deviation  of  our  (9) 
hypothetical  well--behaved  batch 
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A  weli-behaved  batch  with  X  =  0,  and  s  *  i,  is  called  a 
staadard  or  standardized  well-behaved  batch 

Features 

a.     Mean  and  standard  deviation  are  sufficient  to  des- 
cribe a  well-behaved  batch  (explain  statistical 
sufficiency) 

i).     Any  ^ell-behaved  batch  m-v  be  standardized  by  sub- 
tracting the  mean  from  ea^^h  data  value  and  dividing 
the  remainder  by  s,   (X  -  X)/s 

c.     Mean  and  standard  deviation  are  not  sufficient  to 
describe  batches  that  are  not  well-behaved 

Analytic  Qualities 


Median  and  midspread  are  more  resistant,  or  less 
affected  by  outliers,  than  mean  and  standard 

deviation 


b.     Nonetheless,  X  and  s  are  classical  naeasures  of 
location  and  spread  (for  all  bacches) 

Procedures 

a.    X  =  (1/N)    I  X. 

i  ^ 


b.     s   =y—  :  —J 

^(1/N)  Z  (X.  -  X)^ 
i  ^ 

Another  example;     I.Q.  scores  for  100  16  year  old  females 

(10) 

a.  Batch  is  well-behaved:  (11) 
i • •  Symmetric 

ii.     No  outliers 
iii.     X  «  M  =  101 
iv.     s  «  3/4  X  Midspread  =  12 

b.  Standardize  batch.    Note  resemblance  to  hypothetical 
standard  well-behaved  batch  (12) 

,     ,  ,   (13) 

(This  lecture  should  be  followed  by  a  review  of  the 
entire  unit  before  the  quiz  is  given.) 
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Homework  Problerrn 
Unit  1 


1.    A  tminicipality  is  trying  to  decide  between  building  its  own  steam- 
electric  generating  plant  or  purchasing  power  from  a  private 
supplier.    Data  exist  oh  the  installed  generating  capacity  of  33 
plants  in  municipalities  with  similar  sbcibecoriomic  and  demographic 
characteristics.     Installed  generating  capacity  is  a  measure  of 
the  size  of  a  plant.    A  first  step  in  the  decision  process  involves 
examining  the  range  of_ plant  sizes ,     Sort  the  data  on  installed 
generating  capacity;  then  make  a  histogram. 

Do  the  data  cluster  or  are  they  uniformly  spread  out?  Are 
the  data  symmetrical?    Are  there  any  outliers?    If  there  are 
clusters  or  outliers^  wL-re  do  they  occur?    What  can  the 
municipality  infer  from  the  sort  and  histogram? 


Installed  Generating  Capacity  in  Megawatts 


Bull  Run 

950 

Barry 

1770.8 

Colbert  ''A'' 

845.6 

C;inal 

542.5 

Colbert  "B" 

550 

Etiwanda 

1069.1 

Gallatin 

1255.2 

Astoria 

1550.6 

Johnsonville  "A" 

1485.2 

Ravenswood 

1827.7 

Johnsonville  "B" 

Conemcugh 

1872 

Kingston 

1700 

Kyger  Creek 

1086.3 

Paradise  "A*' 

1408 

keystone 

1872 

Paradire  "B" 

1150.2 

Elrama 

510.3 

Jb'iitl  Sevier 

823.3 

Mt.  Storm 

1140.5 

Shawnee 

1750 

Joppa 

1100. J 

Widows  Creek  "A'* 

853 

Four  Corners 

1636.2 

Widows  Creek  "B*' 

1125 

Fort  Martin 

1152 

Big  Sandy 

ie-^0.8 

Wabash  River 

908  ■ 

Cane  Run 

1016.7 

Parish 

i255.4 

Clifty  Creek 

1303.6 

Sam  Bertron 

826.3 

Gannon 

1270.4 

QMPM 


2.  Below  is  a  histogram  of  the  1973  population  of  the  U.S.  for  the 
fifty  states  and  the  District  of_eolumbia  and  the  6  data 
from  which  the  histogram  was  composed.  What  is  the  interval  of 
popuiation_size  into  which  the  largest  nur^^  states  fall? 

What  is  the  number  of  states  in  that  interval?    Which  states  are 
the  outliers  of  this  batch?    How  would  a  logarithmic  transfor- 
mation of  this  batch  affect  the  display?    Data  are  on  the  next 
page. 
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State  1973  Population 

(in  thousands J 


Maine 

1,028 

791 

464 

Hassarhvi  ^  t'ts 

5^818 

iSode  .  ! 

973 

Conner  t  ^  cii  t 

3-076 

New  York 

18,265 

New  Jersey 

7,361 

Pennsylvania 

11,902 

Ohio 

10,731 

Ind  iana 

5,316 

Illinois 

11,236 

Michigan 

9,044 

Wise  Otis  iti 

4,569 

Minnesota 

3,897 

Iowa 

2,904 

lussourx 

A  7^7 

N.  Dakota 

640 

S.  Dakota 

685 

Nebraska 

1*542 

Kansas 

2,279 

Delaware 

576 

Haryland 

4,076 

D.D, 

746 

Virginia 

4,811 

W,  Virginia 

1,794 

State  1973  Population 

(in  thousati^"^  * 


N;  Carolina 

5,273 

S .  Carol ina 

2,726 

  _  < .  . 

4,786 

Florida' 

7,678 

Kentucky 

3,342 

Tennessee 

4,126 

Alabama 

3,539 

Mississippi 

2,281 

Arkansas 

2,037 

Louisiana 

3,764 

Oklahoma 

2^663 

Texas 

11,79^ 

Montana 

721 

Idaho 

770 

Wyoming 

353 

Loiori*  iO 

9    A  Q7 
Z  ,  *♦  J/ 

New  Mexico 

1,106 

Arizona 

2,058 

Utah 

1,157 

Nevada 

548 

Washington 

3,429 

Oregon 

2,225 

California 

20,601 

Alaska 

330 

Hawaii 

852 

Below  are  the  test  scores  ef  fifty  fifth  grade  students.  Make 
a  stetn-and-leaf  and  a  schematic  plot  of  this  batch.  What 
are  the  tnean  and  standard  deviation  of  this  batch?    How  well- 
behaved  is  this  batch?    What  is  the  median  of  the  batch?  How 
does  it  compare  to  the  mean? 

72  112  56  104 

67  135  97  66 

76  102  97  78 

77  93  63  82 
92  87  53  81 
85  81  112  96 
79  106  100  72 
65  71  49  83 
75  82  77  67 
83  112  93  78 
89  102  86  99 

102  96  90  105 

118  80 
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Ui    a.    As  a  n^ber  of  the  mayor's  task  force  on  restdenttai 

tntegratton  you  have^been  asked  to  make  a  study  of  the 
dtstrtbutton  of  nonwhttes  tn  OUaha^  Nebraska^    The  data 
below  are  from  the  1976  ViS,  Census  of  Population.  They 
give  the  percent  of  the  population  of  each  census  tract 
in  Omaha  that  is  nonwhtte.     Put  them  in  the  form  of  a 
stem-and-leaf  display. 


0.253 
27.276 
52.831 
6.941 
6.121 
6.119 
7.27E-62 
1.056 
0.206 
17.297 
15.269 
0.326 
5.96E-02 
7.05E-02 
0.239 
0.262 


0.666 
21.528 
35.694 
1.329 
0.233 

6.90E-02 
0.110 
0.475 
0.145 
5.663 
20.628 
0.288 
0.446 
0.118 
0.128 

5.44E-02 


3.811 
43.338 
98.315 
0.357 
0.593 
0.777 
8.64E-02 
0.369 
0.563 
9.35E-02 
1.875 
0.129 
5.94E-02 
0.114 
14.012 
11.111 


0.559 
44.603 
69.719 
1.095 
0.591 
0.322 
0.421 
0.363 
0.174 

5.58E-02 
1.507 

9i57E-02 

7i41E-02 
1.313 
0.232 

2.37E-02 


2.306 
37.037 
1.597 
0.511 
0.689 
0.152 
0.544 
0.153 
7.943 

5.33E-02 
1.336 
0i296 

2.57E-02 
1.473 
0.153 

6.79E-02 


10.188 
42.392 
6.769 
0.247 
4.660 
3.39E-02 
0.117 
0.132 
23.754 
0.605 
1.083 
4i82E-02 
3.39E-02 
5.05E-02 
0.207 
5.29E-02 


b.  Prepare  a  five  number  suntnary  of  these  data  and  present  it 
as  a  letter-value  display 

c.  Present  a  fenced  letter  display. 

d.  Prepare  a  schematic  plot  of  these  data. 


e,    Ihe  members  of  the  mayor's  task  force  are  unfamiliar  with 
stem-and«leaf  display.      Pat  the  data  into  the  form  of  a 
histogram.    What  information  has  been  lost  in  going  from 
one  to  the  other? 


■194 


The  figure  below  Is  a  schematic  plot  of  the  percent  of  famllle 
In  Omaha  census  tracts  with  Ihcpmes  below  the  poverty  Ijvel 
In  1970.    Label  the.  different  kinds  of  outllerSi  the  hinges, 
and  the  median  and  Indicate  the  values  that  these  points 
correspond  to. 


o 
Sdi 


^0 


6 
0 


• 


I 
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Test  scores  on  a  group  of  children  (age  10)  from  the  same 
neighborhood  were  as  follows: 

2.95,  3.22,  3.32,  3.40^  3.59,  3.73,  3.80 

To  study  the  effect  oh  various  summaries  of  a_ change  in  one 
value  in  a  batchj  vary  the  value  shown  as  3.22.  Examine 
the  effects  on  the  tnean,  median^  stMdard  deviation^  and 
S  *  3M  X  tnidspread,  as  the  moving  value  goes  from  below 
2.95  to  above  3.80.    Use  intervals  of  .2  (i.e.,  moving 
value  first  equals  2.90,  then  3.10,  then  3.30,  etc.)  Also 
move  the  value  to  4.90. 


P^^^  two  different  stein-and-leaf  displays  of  the  welfare 
data_by  stretching  or  squeezing  the  stem.    V^liich  do 
you  think  is  preferable?    Why?    Do  they  both  give  the 
same  information  about  the  batch? 


(b)  What  do  you  infer  from  your  analysis  about  the  cost  of 
welfare  per  inhabitant?    Pay  particular  attention  to 
outliers . 

(c)  Summarize  the  data  in  a  letter-value  display.    Now  exclude 
outliers  and  present  in  a  letter-value. display.  Comment 
on  the  differences. 

Data  are  on  the  next  page. 
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1972  Cost  of  Welfare 


Alabama 

$  46.43 

Alaska 

50.09 

Arizona 

27.05 

Arkansas 

49.70 

California 

97.30 

Colorado 

52.48 

Connecticut 

38.03 

Delaware 

36.56 

District  of  Columbia 

100.44 

Florida 

23.52 

Georgia 

44.55 

Hawaii 

52.21 

Idaho 

30.40 

Illinois 

55.43 

Indiana 

23.53 

Iowa 

33.51 

Kansas 

30.30 

Kentucky 

35.92 

Louisiana 

51.36 

Maine 

51.61 

Maryland 

35.64 

Massachusetts 

71.23 

Michigan 

56.72 

Minnesota 

45.89 

Mississippi 

48.22 

Missouri 

40.43 

Montana 

24.24 

per  Inhabitant  by  State 


Nebraska 

$  27.76 

Nevada 

20.36 

New  Haopshire 

32.73 

New  Jersey 

50.27 

New  Mexico 

35.98 

New  York 

89.37 

North  Carolina 

25.59 

North  Dakota 

29.45 

Ghlo 

30.85 

Oklahotna 

54.90 

Oregon 

35.94 

Pennsylvania 

50.55 

Rhode  Island 

50.87 

South  Carolina 

16.35 

South  Dakota 

28.75 

teruiessee 

33.51 

Texas 

33.61 

Utah 

33.35 

Vermont 

54.65 

Virginia 

25.67 

Washington 

48.16 

West  Virginia 

34.01 

Wisconsin 

32.39 

Wyoming 

18.50 

(Source:     1975  World 

Almanac,  Page 
157) 
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A  recent  study  of  career  choice  listed  the_percentage  of 
doctorate-holders  Who  held  a  Job  in  the  same  field  as  thetr 
doctorates.    Prepare  a  stem-and-leaf  display  of  the  results. 


__j  cluster  or  are  they  uniformly  spread  out?__Are 

the  data  symmetrica 1?  Are  there  any  outliers?  if  f^ere  are 
any  clusters  or  outliers ,  where  do  they  occur?  Uhat  can  you 
infer  about  career  choice  from  your  analysis? 

Mathematics  91% 

Physics,  Astronomy  90% 

Chemistry  84% 

Earth  Sciences  93% 

Engineering  92% 

Agriculture,  Forestry  73% 

Health  Sciences  78% 

Biochemistry,  Physiology, 

Biostatistics  70% 

Anatomy ^_Cyt61pgy,  Gene- 

tics>  Ehtbmblogy  47% 

Botany^  General  Biology, 

Botany  51% 

Anthropology y  Archaeo- 


logy 
Sociology 

Economics ,  Econometrics 
Political  Science,  In'ter- 


76% 


79% 


national  Relations 
History 

i«aEiguage,  Literature 
Fhilbsbphy,  Arts 
Business,  theology 
Education 
Psychology 


90% 


81% 


73% 


70% 


85% 


83% 


81% 


IBs 
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9.    identify  the  following  batches  as  bounded  ntimbers,  mndunts, 
counts,  or  differences. 


(«)    The  average  hourly  earnings 

1950 
1955 
1960 
1965 
1970 


in  manafacturing  industries  were: 

$1.44 
1.86 
2.26 
2.61 
3.36 


(b)  Grain  receipts  at  western  Canadian  grain  centers  in  1972-73  (i 
Thousands  of  Bushels): 

Wheat  633,258 

Oats  32,484 

Barley  236,816 

Rye  9,252 

Flaxseed  18,346 

Rapeseed  62,949 

(c)  Unemployment  rate  for  Americans  aged  16  and  over; 

Spanish  7.5 
White  4.3 
Black  9.3 

(d)  Indians  in  North  Dakota,  1970: 

Apache  9  Raw,  Onaha  33 

Cherokee        50  Lumbee  33 

Chippewa  6,721  Shoshone  11 

Creek  18  Sioux  3,655 

Iroquois        45  Other  1,629 
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PerfdrtDfliicM  of  record  long  rtm  Broadray  plays: 


Fiddler  on  the  Roof  3^242 

Life  with  Father  3,213 

Tobacco  Road  3,182 

Hello  Dolly  2,844 

ffy  Fair  tady  2,717 

Hah  of  LaMahcha  2,328 


Change  in  popalation  in  major  Alaskan  cities  between  1960  and 
1970  census: 


Anchorage  3,844 

Fairbanks  1,460 

Juneau  -  747 

Ketchikan  511 

Spenard  9,015 

Sales  of  recreational  vehicles,  1973: 

Travel  trailers  212*300  units 

Motor  hodses  129,000  units 

Truck  campers  89*800  utilts 

Camping  trailers  97,700  units 

Pickup  covers  223*700  units 


Percent  of  high  school  seniors  with  no  college  or  vocational  school 
plans,  by  family  income  (1974) 


Under  $5,000  27.1 

$5,000  -  $7*499  23.5 

$7,500  -  $9,999  21. G 

$10,D00-$14,999  19.7 

$15,b0b-$24,999  15.3         2  00 

$25,000  and  over  6.9 
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(i)    Distance  ft dm  home  to  college  for  first-time  students  in  4-year 
colleges*  1973: 


Distance i 

in  Miles  Distribution 


16  or  less  15.8 

11-^0  19.9 

51-106  15.9 

101-500  35.6 

more  than  500  12.8 


(j)    Average  raise  received  by  instructional  staff  in  universities  at 
beginning  of  1975-76  school  year: 


Professor  $1,768 

Associate  Professor  1,045 

Assistant  Professor  848 

Instructor  857 

(k)    Distribution  and  frequency  of  lew-income  families,  by  place  of 
residence 


Number  in 
Residence  G roup  (Mi  1 1 i ons ) 

Urban  27.5 
Rural  non-farm  11.4 
Rural  farm  4.8 

(1)    U.  S.  shoreline,  in  statute  miles: 

Atlantic  cioast  28,673 

Gulf  coast  17,141 

Pacific  coast  40*298 

Arctic  coast  (Alaska)  88,633 
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10.    BiK>w  is  a  list  of  food  indexes  for  major  U«S.  cities  in  July, 
1974.    Prepare  a  stem-ahd^leaf  display  and  five  huraber  summary. 
How  does  this  batch  compare  to  the  hypcvhv.tical  well-behaved 
batch? 


Atlanta 

162.7 

Milwaukee 

154.8 

Baittmore 

163.1 

Minneapolis 

162.9 

Boston 

161.6 

New  York 

165.0 

Buffalo 

159.9 

niiladelphta 

164.5 

ehicago 

160.4 

Pittsburgh 

162.9 

Cincinnati 

163.2 

Port land 

154.8 

Cleveland 

159.2 

St^  Louis 

157.6 

Dallas 

155.7 

San  Diego 

159.2 

Detroit 

162.6 

San  Francisco 

154.8 

Rbholulu 

156.9 

Scranton 

159.3 

Rous ton 

162.7 

Seattle 

155.3 

Kansas  City,  MO 

160.7 

Washington,  D,C. 

164.4 

Los  Angeles 

155.5 
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11,    Population  densities  by  state  are  skewed  towards  low  density. 
Select  an  appropriate  trarisformatiph  for  symmetryi  based  on 
the  sinnmary  numbers.    Do  the  transformation  and  present  the 
results  in  a  stexn-atid-leaf  display.    Discuss  clustering^  outliers 
and  symmetry. 

Population  Density  by  State,  1970 
(people  per  square  mile) 


D  / .  y 

Montana 

4.8 

A T  a  Qtr  A 

^  B  9 IVCI 

Nebraska 

19.4 

13.  0 

Nevada 

4.4 

A            n  c»  a  0 

/^FKsnsas 

37 . 0 

New  Hampshire 

81.7 

ua  IX  romia 

127 .  0 

New  Jersey 

953.1 

wox  oraoo 

21,3 

New  Mexico 

8.4 

Connecticut 

623,7 

New  York 

381.3 

pe  lawa  re 

276.  5 

North  Carolina 

164.1 

district  ox  LolumDia 

to    /  rtf  D 

12 ,h01 ; 8 

North  Dakota 

8.9 

1  ^->.  5 

Ohio 

260.0 

Georgia 

79iO 

Okianoma 

37.2 

119.8 

Oregon 

21.7 

Idaho 

8.6 

Pennsylvania 

262.3 

Illinois 

199.4 

Rhode  Island 

905.5 

Indiana 

143.9 

South  Carolina 

85.7 

Iowa 

50.5 

South  Dakota 

8.8 

Kansas 

27.5 

Tennessee 

94.9 

Kentucky 

81.2 

Texas 

42.7 

Louisiana 

81.0 

Utah 

12.9 

Maine 

32.1 

Vermont 

47.9 

Maryland 

396.6 

Virginia 

116.9 

Massachusetts 

727.0 

Washington 

51.2 

Michigan 

156.2 

West  Virginia 

72.5 

Minnesota 

48.0 

Wisconsin 

81.1 

Mississippi 

46.9 

Wyoming 

3.4 

Missouri 

67.8 

(World  Almanac,  p.  154) 
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Homework  Sblutlbhs 
Unit  1 


I.      Installed  Generating  Capacity  in  Megawatts  (Sorted) 


1140.5 

510.3 

1150.2 

542.5 

1152 

55b 

1255.2 

691.2 

1255.4 

823.3 

1270.4 

826.3 

1303.6 

846.6 

1408 

853 

1485.2 

908 

1550.6 

950 

1636.2 

1616.7 

1700 

1069.1 

1750 

1086.3 

1770.8 

1096.8 

1827.7 

1100.3 

1872 

1125 

1872 

The  data  citxster  between  8D0  and  1300,  so-^they  are  not 
uniformly_spread  out*    ihe  batch  ts  roughly  symmetrical  and  has 
no  butlier8._The  munlclpaii^  interested  in  noting  that 

plant  sizes  tastoiiar  municipalities  range  from  500^1900  megawatts 
Installed  generating_capac±tyi    Centr      values  of  that  range  are 
observed  more  often  than  the  extremes  and  a  typical  value  (the 
median)  is  1140,5  megawatts. 
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a)  The  interval  of  population  size  Into  which  the  largest  nisnbef 
of  states  fails  is  zero  to  one  million. 

b)  The  mnnber  of  states  in  that  interval  is  thirteen. 

c)  California  and  New  York  are  the  outliers-  of  this  batch. 

d)  A  logarithmic ^ransfornation  of  this  batch  would  prdmbte 
symmetry  by  compressing  the  larger  values  in  the  batch 
while  stretching  out  the  smaller  values. 


3. 
A. 


6 
7 
8 
9 
10 
11 
12 
13 


UNIT  •=  1 

9 

3  6 

3  5  6  7  7 

1225677889 
01122335679 
623366779 
0  2  2  2  4  5  6 
2  2  2  8 


130 
110 

9b 
7b 

5b 


i 

I 


STD.  DEV.« 


*   N       «  2 


N 


1.75  Wsere  N«5b  and 

X  ^  mean 


C.  flEDlAN  «  84  (depth  -  25h) 

This  hatch  is  rdaghly_8ySietric_and  Gaussian  in  shape.  The 
nedian  and  mean  are  approximately  equal.    Mtdhinge  *  87.5; 
Mtdextreme  »_92;  these  values  are  also  close  to  the  median,  but 
there  ts^  cleariy_ an  upward  trend.    There  is  one  outside  value, 

well— behaved   batch.  However, 
for  real  data^  thts  batch  cooes  remarkably  close  to  beitig  well 
behaved^    Notice  also  that  3/4  x  midspread  «  17.25,  which  is 
close  to  the  standard  deviation. 
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4.     (A)    Unit  «  .01% 


0* 

022334555555566777899 

t 

11111222345557 

2 

3033345689 

3* 

22566 

4 

247 

5 

145699 

6* 

38 

7 

7 

8 

9* 

Bnit  =  .1% 


0003334558 

3 

8 

6 

6 

79 
9 


0"' 
1 

2 
3 
4 
5 
6 
7 
8 
9 

Unit  =  1% 

1  D1457 

2  3137 

3  ■  57 

5 
6 

7*** 
8 
9 

(B)    #96  Vi  popi  nonwhite  in  Omaha  census  tracts 


M  48h 
H  25 
1 


i355 
11  1.8 
0  98 
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4.      (6)      #96      %  nonwhite  pop.  tn  Omaha  census  tracts 


M  48h 
H  25 
1 


,355 


.11 


1.8 

98 


1.69 


f 
F 


-2.43  4.34 
XXX  three 
-4.97  6.88 

XXX  nineteen 


ADJ 
OUT 
FAR 


0,  3.8 

4.6,  5.6,  6.7 

6.9,  7.9i  lOi Hi 

14i  15i  17i  20,  21i 
23i  21,  35,  37i  42^ 
43,  44,  52,  69i  98 


iE)     Although  the  histogram  shows  that  alinpst  80%  of  the  census 
tracts'  populations  are  less  than  10%  nonwhite^  you  lose  the 
information  that  21  tracts  are  less  than  .1%  noinrtiite,  that  62 
tracts  have  less  than  1%  nonuhite,  and  so  on.     You  also  lose 
the  specific  values. 

In  short  you  have  lost  a  lot  of  the  detailed  information. 
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Man 

•d  ^  n-1 

-  J 
•d«  V  n 

-  3 
S  «  4  X  1 

2;9 

3.38 

3.40 

.35 

.33 

ii9 

3.1 

3.41 

3.40 

.32 

.29 

.34 

3.22 

3.43 

3.40 

.30 

.28 

.29 

3.3 

3.44 

3.40 

.29 

.27 

.26 

3.5 

3.47 

3.50 

;29 

.26 

.23 

3.7 

3.50 

3.59 

.30 

.28 

;26 

3.9 

3.53 

3.59 

;33 

.30 

.30 

4.9 

3.67 

3.59 

.61 

.57 

BBtdiprcad 


While  the  iheah  varies  with  every  shift  of  the  moving  value, 
the  median  j^umps  tvice^  but  remains  constant:  for  any  extreme 
magnitude  of  the  mbyihg  value.    Similarly,  while  the  standard 
deviation  moves  with  every  shift  of  the  moving  value  ^_  the  mldspread 
does  shifty  but  jremalhs  constant  for  any  extreme  magnitude  of  the 
moving  value.    What  is  demonstrated  lis  the  resistance  of  the  median 
and  the  midspread  to  extreme  values  of  the  batch. 
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(a)      Two  likely  stem-ahd-leaf  displays  are: 

1972  Cost  of  Welfare  per  inhabitant  by  State 


unit  =  $1 


unit  =  $1 


1 

8 

6 

• 

68 

1 

8 

2 

3340 

2 

0 

• 

775985 

t 

33 

3 

0302033342 

f 

554 

865555 

s 

77 

040 

9* 

QQ 

# 

9588 

3 

000 

5 

0221104004 

t 

323332 

• 

56 

f 

45555 

6 

s 

6 

• 

3» 

8 

7 

1 

4 

00 

• 

t 

8 

f 

45 

# 

9 

s 

9 

4» 

988 

• 

7 

5 

011000 

10 

0 

t 

22 

f 

445 

-  s  . 

6 

5* 

HI:  1 

71,  89 

There  lis  no  one  preferable  scale;  as  Iph^  as  you  can  defend  it, 
you  may. select  any  scale.    An  argument  could  be  made  in  favor  of 
the  scale  oh  the_  right  (the  stretched  scali)  because  it  gives  a 
better  idea  of  the  shape. 

-    Both  scales  give  the  same  information  about  the  batch;  one 
might  argue  that  it  is  easier  to  read  the  information  from  the 
6tem-and-ieaf  on  the  right,  or  that  the  scale  on  the  left  emphasizes 
the  cluster  and  the  outliers. 

(b)    In_1972^_the  cost  of  welfare  per  inhabitant  ranged  from  $16.35 
in  South  Carolina  to  $100.44  in  the  District  of  Columbia.  There 
is_a_cluster  of  values  around  $30-35  and  a  smaller  cluster  at 
$48-52.     There  are  four  outllers--Massachusetts,  New  York, 
California,  and  the  District  of  Cpluinbla--all  of  which  were  high. 
The  outliers  are  all  states  with  large  metropolitan  areas.  D.C.> 
with  the  highest  cost  per  inhabitant ,  Is  exclusively  urban.  The 
lowest  costs  are  associated  with  rural  states :  S.Carolina,  Wyoming, 
Nevada.    Thus  high  welfare  costs  per  inhabitant  are  associated  with 
urban  areas. 
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7.  (e) 


R  26 

H  13h 
t 

M  24 

H  12h 
E 


#51 


35 

30  50 
16  100 


#47 


35 

29.5  49.5 
16  56 


Wheil  high  outliers  are  excluded^  the  only  ni^er  of  the  five 
number  sunsnary  with  a  major  change  is  the  maximim^    Hinges  change 
very  little.    The  batch  is  Touch  more  sjrmietric  with  outliers 
excluded. 
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8. 


LO 
7 

_  • 

8 
9 


47,51 

3003 
896 
4131 
5 

10320 

1  value  missing 


unit  -  1% 


The  data  do  hot  cluster,  but  are  not  quite  uniform  either,  due 
to  the  gap  at  86-89  followed  by  several  values  at  90-93,  However, 
by  contrast  with  bellshaped  and  skewed  batches,  this  one  may  be 
cpiisidered  uniform.    By  the  sane  reasbhingi  the  data  are  relatively 
symmetric.    There  are  two  low  outliers:  47  and  51.    Roughly  79-93% 
of  dbctorate-hblders  in  various  fields  have  jbbs  in  the  same  field. 
Biblbgical  sciences  are  ah  exception,  where  fewer  doctorate-holders 
are  employed  in  their  field. 


9. 


a) 
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b) 

amount 

c) 

bound ed  numbers 

d) 
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e) 
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f5 

difference 

8) 
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h) 

bounded  numbers 

i) 

bounded  numbers 
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10«    Two  ilkeiy  stein-and-leaf  displays  are: 
imit  «  1 


and 


1  5  f 

5 

5 
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1  5 

9 

9 
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9 

1  6 
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fe 
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3 
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unit  = 

.1 

15  4 

8 

8 

8 

5 

7 

5 

3 

6 

9 

7 

6 

8 

15  9 

9 

2 

2 

3 
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4 

7 

1 

6 

2 

7 

6 

7 

9 

3 

1 

2 

6 

5 

4 

16  5 

0 

4  5 


2  2  2 


The  five  number  stnanary  is: 

 n^25^ 


13  M 
7  H 
E 


160.4 
156.9  162.9 
154.8  165.0 


To  c«npare  with  the  hypothetical  well-behaved  batch,  calculate 
the  midhlnge  and  sttd extreme. 


Midhinge 
Midextreme 


J31^^  159.9 


J.54^+  165.0 
2 


159.9 


This  batch  appears  well-behaved  to  that  its  median*  midhinglj  and 
inldextreme  have  approximately  the  irae  value.    H«»ever,  it  is 
clear  from  either  stem-and-leaf  display  that  the  batch  is  more 
uniform  than  bell  shaped;  therefore.  It  is  not  an  example  of  a 
well-behaved  batch. 


217 

XVI. I. 165 


9im 


11)    After  a  logarithmic  transformation,  the  data  are: 


unit  *  .1 


-  0  . 

+  e  . 

2  . 

3  . 


5  6  6  9  9  9  9 
1  1  2  3  3  4 

5556666778889999999 

00011113444 

5  6  7  8  9  9 


These  data  now  nore  closely  approximate  the  well-behaved 
batch.    They  are  roughly  symmetric;  with  a  cluster  around  the 
center  (median  1.8).    There  are  two  outliers:    one  high  at  4.0  (D.e.) 
and  one  low  at  -.3  (Alaska). 
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Unit  1  Quiz 

i.    Answer  the  foiiowing  questions  briefly  and  generally. 

1.  What  is  a  batch? 

2.  How  are  the  median  and  mean  affected  by  deviant  values 
in  a  batch? 

3.  What  is  the  midspread? 

4.  What  is  the  midhinge? 

5.  What  techniques  are  there  for  condensing  a  batch? 

.  6.    What  are  the  possible  advantages  of  condensing  a  batch? 

7.  What  are  the  most  common  transformations? 

8.  What  is  the  simple  ladder  of  powers? 

9.  What  are  the  possible  advantages  of  transforming  single 
batches? 

id.    What  is  a  well-behaved  batch? 

11.  How  may  two  well-behaved  batches  differ? 

12.  How  is  a  weii-behaved  batch  standardized? 
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II.  Below  is  a  list  of  the  Infant  mortality  ratea  (deatfta  per  IQOG 
live  births)  for  the  sixteen  Eastern  Montana  counties. 


Carter 

28.4 

Powder  River 

15.0 

Custer 

14.2 

Prairie 

15.7 

patllels 

26.3 

Richland 

16.6 

Dawson 

17.2 

Robs eve It 

42.2 

Falidh 

21.1 

Rosebud 

40.8 

Garfield 

12.1 

Sheridan 

19.7 

McCone 

24.2 

Valley 

26.2 

Phillips 

28.1 

Wibaux 

27.2 

Do  the  following  with  the  data: 

1.  Sort  the  batch. 

2.  Prepare  a  stera-and-leaf  display. 

3.  Make  a  five-number  summary. 

4.  Make  a  schematic  plot. 

5.  Discuiss  the  data^  based  oh  your  worK  in  parts  1-4. 


(By  now  you  should  know  what  questions  to  ask  of  a  batch.) 
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Unit  1  Quiz 
Solutions 

!•     1.    A  batch  is  a  set  of  similar  numbers  obtained  in  some 
consistent  fashion. 

2.  The  median  is  affected  very  little  by  deviant  values.  Extremely 
large  values  may  increase  the  mean  a  lot,  while  extremely 
smaii  values  may  greatly  lower  the  mean. 

3.  The  mid spread  is  the  distance  between  the  hinges  (upper  hinge 
-  lower  hitige)  .     It  is  a  measure  of  spread. 

4.  The  midspread  is  the  value  halfway  between  the  hinges 

^upger  hinge  +  lower  hlnge^ ^  ^  measure  of  central  location. 


5.     Schematic  plots,  expanded  schematic  plots,  five  number 
summaries »  expanded  number  summaries  are  techniques  for 
condensing  a  Satch. 


6.  The  purpose  of  condensing  a  batch  is  to  sunnnarize  it  by 
describing  a  typical  value  and  variation  of  the  values, 

and  identifying  outliers. 

__  _   J  _  _--   z_  :      _  R  -  " 

7.  ebmmon  transformations  are  of  the  form  Xh^ X    where^  R  ^  -1, 

0,  1/2,  2  (that  is^  negative  reciprocals,  logarithms,  square 
roots,  and  squares).    The  arcsih  of  the  square  root  of  x 
is  another  cbnmon  transformation. 


8.  The  simple  ladder  of  powers  is  a  plot  of  x  against_  

transformations  of  x  showing  the  direction  and  to  some  

extent  the  rapidity  with  which  the  transformation  changes 
the  batch. 

9.  Transformations  reexpress  the  batch  in  units  th'at  are  desirable 
for  intended  analysis.     This  usually  means  increasing 
symmetry;  reducing  outliers  and  variance  is  also  desirable. 

10.  A  well  behaved  batch  has  median  =  mean  -midhinge  «  midextreme, 
has  s  =  3/4  AH.  has  no  outliers  and  resembles  a  Gaussian 
function  iii  shape. 

11.  Two  well-behaved  batches  may  differ  only  in  location  and 
scale. 

12.  To  standardize  a  well-behaved  batchy  subtract  the  mean  from 
each  value  and  divide  by  the  standard  deviation. 
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ii. 


1. 

12ii 

24.2 

26.2 

1^  A  -  h 

Xj9  U 

26.3 

27.2 

16.6 

28.1 

17.2 

28.4 

19.7 

40.8 

21.1 

42.2 

2. 

Unit 

=  1  death  per  1000  live 

1  * 

24 

• 

55679 

2  * 

14 

66788 
HI     I  40.8,  42.2 

3.  #16 


M  8h 

22.6 

H  4h 

16.1 

27.6 

midspf ead 
11.5 

E  . 

12.1 

42.2 

optional: 

1  17.3 

f 

-1,2 
*** 

44.9 
*** 
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12.1  42.2 
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Infant  mortality  rates  range  from  12 il  to  42.2  deaths  per  1000 
live  births  in  Eastern  Montana.    Although  the  step  is  large 
enough  that  there  are  no  outside  values,  there  are  clearly 
two  high  outliers^    It  is  worth  looking  more  deeply  into  the 
populations  and  standards  of  living  In  Roosevelt  and  Rosebud 
counties.    The  data  do  hot  particularly  cluster,  and  a  typical 
infant  mortality  rate  is  22.6. 
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Quantitative  Methods  for  Pubiic  Management 

tectare  2-0,    introduction  to  Unit  2 


Introduction  to  Unit  2,  Analysis  of  Multiple  Batches  of  Data, 
Non-Ordered 


Lecture  Content ; 


Introduction  to  the  objectives,  problem,  and  notation 
Unit  2 


Main-Topics : 

1.  Specific  introduction  to  the  Objectives  of  Unit  2 

2.  Presentation  of  General  Problem  of  Unit  2 

3.  Notation  for  Unit  2 
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Topic  !•     Specific  Introduction  to  the  Objectives  of  Unit  2 


I. 


Questions  to  be  answered  in  Unit  2 


1.    What  is  a  hon-brdered  multiple  batch? 


(1) 


a.  A  collection  of  two  or  more  batches  related  in 
some  qualitative  way 

b.  There  is  no  quantifiable  ordering  of  the  batches  (2) 
in  the  collection 

2.    What  analyses  can  be  done  on  a  collection  of  batches? 

a.  How  can  we  best  examine  or  contrast  the  batches? 

(Note:     Since  we  are  studying  more  than  1  batch, 
we  can  discuss  comparison  of  batches) 

b.  What  is,  if  one  exists,  the  best  unit  of  analysis 
for  the  examination 


ii.     Skills  to  be  mastered  in  Unit  2  (3) 

1.    Perceiving  and  recognizing  multiple  batches  that  are 
non-ordered 


2.  Organizing  the  batches  to  facilitate  comprehension, 
presentation,  and  analysis 

3,  Comparison  of  the  batches  in  the  collect^.on 

A.     Transformations  to  stabilize  variation  across  the 


batches 
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Topic  2.     Introduction  to  the  Problems  of  Unit  2 

i.    What  is  a  non-ordered  multiple  batch? 

1.     Example ;     1970  population  of  the  185  census  tracts  (4) 
in  Pittsburgh,  and  the  96  census  tracts  in  Omaha, 
Nebraska 


a.  Relation:     1970  populations,  by  census  tract 

b.  Qualitative  aspect:     2  major  SMSAs 

2.     Ordered  batches  are  associated  in  a  gaantitative  _ 
manner — we  can  measure  the  relationship  between  the 
batches  in  some  unit.    These  will  be  considered  in 
future  lectures. 

II.     How  can  we  compare  the  batches?  (5) 

1.  Minimum  values — Which  batch  has  the  smallest  minimum? 

2.  Maximum  values — Which  batch  has  the  largest  maximum? 

3.  Median  values — How  do  the  typical  values  of  the  batches 
compare? 

4.  Spreads — Which  batch  has  the  smallest  midspread? 
Which  has  the  largest? 

5.  Shape — Are  the  batches  symmetric? 

Do  the  batches  have  similar  stem-and-leaf  displays? 

6.  Units — Are  the  batches  measured  in  the  same  units? 

Ill,     Is  there  a  better  unit  of  analysis? 

1.  Are  the  extremes  roughly  equal? 

2.  Do  the  batches  have  similar  ranges? 


IV.  Examples 

1.     Population  data  (6) 

a.     Minimums:     334  (Pittsburgh),  12  (Omaha,  why  so 
small?) 
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b.    Maximums:    7910  (Pittsburgh),  12458  (Omaha) 

€•    Medians:    2602  (Pittsburgh)^  3402  (Omaha) 

d.     Spreads  (Midspreads) :     2248  (Pittsburgh) ^ 
3051  (Omaha) 

e«     BOwh  batches  measured  in  numbers  of  persons 

f.  Omaha  batch  has  larger  range  than  Pittsburgh 

g.  Cannot  compare  shape 

2.  Achievements  Pretest  Scores  for  incoming  students,  (7) 
by  undergraduate  studies 

a.  Minimum  (21)  and  Maximum  (48)  equal 

b.  Medians:    31.5  engineering  and  science, 
35  humanities  and  social  science 

c.  Midspreads:     16  engineering,  13  humanities 

d.  Batches  measured  in  number  of  correct  answers 

e.  Batches  appear  quite  similar 

3.  Life  expectancies  for  various  countries  by  5  national  (8) 
groupings 

industrial  (20  countries) 
Petroleum  Exporting  Countries  (9) 
High-income  Countries  (24) 
Middle- income  Countries  (19) 
Lower-Income  Countries  (33) 

a.  Minimums,  Maximums,  Medians  vary  greatly 

b.  Midspreads  vary  roughly  from  3  to  13  years 

c.  Batches  appear  quite  dissimilar 

Conclusion 

1.  Need  methods  of  comparing  batches 

2.  Need  methods  of  determining  whether  transformation 
is  warranted 
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Topic  3.     introduction  to  the  Notation  of  Unit  2 
I.  Conventions 

1.  Capital  letter  (''X")  denotes  entire  data  iset  (9) 

2.  First  subscript  (Xj)  denotes  specific  batch 

3.  Second  subscript  (X^  J  denotes  ispecific  element  in 
a  specific  batch  ^ 

11.     Example:    Life  expectancieis 

1.  Let  X    »  Xife  expectancies  for  countries 

2.  Let  X^  =*  Life  expectancies  for  industrial  countries 


X2  =  Life  expectancies  for  Petroleum  exporting 
countries 


X^  =  Life  expectancies  for  Lower-income  countries 


3.     Let  X^^  =  Life  expectancy  for  Australia 
X^2  ~  I-i^e  expectancy  for  Austria 


X- Life  expectancy  for  Zaire 
Let  X^j  =  Life  expectancy  for  country  j  in  batch  i 


XVI. I. 176 


Module  i 


Lecture  2-0 
Transparency  Presentation  Guide 


Lecture 
Outline 
Location 


Topic-X 
Section  I 

i.a 

lib 
Section  II 

i; 

topic  2 
Section  I 
1. 

Section  II 

u 

Section  ZV 
1. 

*•  • 
3. 

Topic  3 
Section  II 
1. 


Transparency 
Number 


7 
8 


Transparency  Description 


Multiple  Batch 
Non-ordered  Batches 

Topics  for  Unit  2 


1970  populations  of  Pittsburgh 
6c  Omaha 


Questions  to  Be  answered  for 
Multiple  Batches 


Populations,  several  values 
indicated 

Achievement  Pretest  Scores 

Life  Expectancies  for  Various 
Countries 


Notation 


229 

XVl.Iil77 


ERIC 


ItA^f)  ^  ii^«i/ji.€.  "mu/^/^le  feafcHcs 


XVI.I;178 

O 

ERIC 


Module  I 


g-XO^p>Us>  AUn.  Qc«j.c«-cd    »1u.fK|ate.  Bcc^dtCS; 


QMPM 


§72. 

20BS. 

1107. 

2471. 
765. 

5747. 

2155. 

1253 

2744. 

5269. 

5319. 

1521, 

1985. 

3156. 
4719. 

996  c 
2297c 
2679. 
3331 
1792. 
3103. 
1209 
3812. 
6242. 
3297  : 


4082, 

2919; 

2973 

1050. 

1359c 
728 

2776, 

2135 

4247. 

2316. 

3228: 

2979 

5435. 

4615. 

5203. 

3578. 

3509 

2607 

335: 

1227. 

2779. 

2932c 

2487. 

3853. 

2612, 

5818 

3413. 


1972. 

2424. 

3712. 

1145. 

2055. 

1205c 

2135. 

133^ 

1472. 

3092 

3769. 
5148. 
1212. 
3994. 

404 
2398. 
6796. 
2396. 
6235 
1159 

345, 
2125. 
1193 
3905 
1640 
6527, 
334. 


391. 

6017. 

2505. 

3629. 

1876. 

2382. 

1349. 

<730. 

3832 

4060. 

4058. 
3268. 
2041 . 
7910. 
5088. 
1424. 
5371. 
1870 

3121: 

2579 
4327. 
1056. 
3291 
2057 
3921. 
955c 


631, 
729. 

1919. 
453. 

2915 

3122. 

3628; 

2942. 

I452e 

3915. 
4114. 

3133. 
2068. 
3181. 
3962. 
3820; 
5638. 
2574. 

591: 

569. 
2987. 
2325  c 
1044. 
4437 
992, 
5300. 


735. 

3669 

3294. 

1645. 

1415, 

1019. 

4415. 

3469c 

3378. 

2602. 

1645. 

4520. 

1577. 
4392. 
3752. 
2il9c 
3765. 

603. 
1558. 

508. 
2254. 
1963 
4  561: 
1399 
1986. 
1209 


1938: 
2437. 

449 
2447: 
2386. 
1418 
3153, 
4095. 
1971, 

048 
5308. 
6003. 
2084 
4758, 
1884 
1418. 
7425 
2658 
1343 
442 
2569 
-719, 
3609 
2144 
3425 
2829 


S524. 
4004 

053. 
2648. 
3628. 
2954. 
1326 
5522. 
§414. 

3450: 

S401.: 
§054 

4109: 

135. 


12. 
1959c 
1212. 
2542 
5408. 
5501. 

1194: 

5859. 
5374. 
6139. 
12456, 
9926. 
3114 
4213- 


3254. 

2177. 

2755: 

3244. 

7501  c 

5476 

3248. 

5173 

5627. 

923 
5035. 
7644. 

992 
5000. 


3040. 

2538. 

1566. 

3312. 

4358 

3473. 

22ii. 

4079: 

5702r 

01 30  c 

2466 

5267. 

1725. 

7566 


232 


2298. 
2241  . 
1700 

3004; 

2713. 
5457c 
3912. 

3410: 

3471  . 
9366  e 
6733c 
038. 
3269, 
7356c 
O 


3573. 

1440. 

2400. 

2359. 

3110. 

2756c 

2219. 

3197. 

3054. 

0952 

4049 

1033, 

4347. 


3142 
720 
3357 
2540. 
4682 
2573 
2912. 
4379. 
5972 
7315, 

11874 
1528 


Hbdulie  I 


ERIC 


233 


XVI. I. 181 


PittSb 

_^  - 

-  -a       -  1— 

U 



972. 

4082 . 

A  ?  z  ♦ 

391. 

631 . 

7  35 

1938 . 

Ii62. 

»  7  -i-  7  . 

«  4  «  4  . 

6687. 

729. 

3689 . 

2437 . 

208$. 

2973. 

3712. 

2505. 

1919. 

3294 . 

-  449 

4187. 

1050. 

1645. 

3629. 

-453. 

1645. 

2447, 

1867. 

1359. 

2855. 

1876. 

2915, 

140  5. 

2386. 

24  71 

728. 

1205. 

2382. 

3122. 

1019 

1410. 

7SS. 

2776 

2135, 

1349. 

3628, 

4415, 

3153. 

5747. 

2135 

1330 

4730. 

2942. 

3469. 

4095 

2155. 

4247. 

1472: 

3632. 

1452. 

3378. 

.  1971. 

1253. 

2316. 

3092. 

406fl. 

3945, 

2tt2^ni4 

»i|  846 

2744. 

3228. 

3769. 

4656. 

4014. 

1645. 

5300. 

5269 

29  79. 

5146. 

3268. 

3133. 

4520. 

6003, 

5319. 

5435. 

1212. 

2041  . 

2068. 

1577. 

20B4. 

1521 

4615 

3994. 

Mi^M^a  3186. 

4392. 

4758. 

1985. 

5203. 

484  , 

5880. 

3962. 

3752. 

1684  . 

3156. 

3578. 

2398 

1424 

3620. 

2019. 

1418. 

4719. 

3509. 

6796.. 

5371  . 

5630. 

3765. 

7425. 

996 

2607. 

2396. 

187P. 

2574. 

603. 

2658. 

2297. 

335. 

6235. 

3121. 

791  . 

1556. 

1343. 

2678. 

1227 

1159. 

2579 

569. 

588. 

442; 

3338 

2779. 

345. 

4327. 

2987 

2254. 

2569 

1792. 

2932. 

2125 

1056. 

2325. 

1963 

719: 

3103. 

2^87. 

1193. 

3291  c 

1044. 

4561. 

3609, 

1289 

3953. 

3905 

2857. 

4437. 

1399. 

2144 

3812 

2612. 

1640. 

3921  . 

992, 

1906. 

3425 

6242. 

58]8< 

6527. 

955. 

5300. 

1289. 

2829 

3297: 

3413. 

5524. 

1959? 

3040. 

2298 

3573. 

3142 

4004. 

2538. 

2241. 

1446 

720 

653 

12)2 

2755 

1566 

1700. 

2406. 

3357 

2648. 

25«? 

3244. 

1312. 

3W04 

2359 

2540 

3628 

5408. 

7581  . 

4350 

2703. 

3110. 

4602 

2954. 

5501  . 

5476 

3473  . 

5457 

2756. 

2573 

1326 

1894 

3248. 

2201  : 

3912.     -  1 
jjll  9iltd 

2269. 

2912 

5522. 

5659. 

5173 

4079. 

3197. 

4379, 

6414.  . 

Uil,  -1 

5374 

5627. 

5782. 

3854. 

5972, 

6139.  _ 

923. 

6130, 

9366! 

6952. 

7315 

5035 

2466. 

6731. 

4049 

7783, 

8B54 

7654. 

5267. 

838. 

1833. 

11874 

4189. 

3114 

992. 

1725; 

3269. 

4347. 

1528. 

135 

4213. 

">88B. 

7566 

7356. 

3CVI. 1.182 


Module  I 


21  , 

21  . 

22 

23i 

24. 

24. 

25. 

25. 

2Si 

27.^1 

26. 

27. 

27. 

?8. 

27. 

29. 

28.  ^ 

39. 

30. 

33. 

33c 

33 

35. 

33- 

37. 

33^ 

41. 

37* 

<1 

37. 

W4  2 

49ll 

43- 

40* 

43; 

41 

46. 

42. 

47, 

48:M^X 

48. 

3tVI.I;183 


£1] 


Host  rai  l  1 
Austt  ia 
Be  I  q  i  ui' 
CahiiJi 

_inl  an*? 
Prance 

M.  Germany  . 
Ireland 
Italy 
Japan 

Nfth«r lands 
N«w  Z#alan(f 
Norway 
Pdrtuqil 
S.Afrlca 
Sweden  _ 
f>wi  tier  land 
Great  Iritaln 
United  Statcfi 


Petroleum 

71.1  Algeria 

7<*.5  rcuartbr 

i?.6  In'^onesia 

77  9  Iran 

71.  j  tr^q 

7?.1  Ni.jeria_ 

7P.J  fiaiHjIa  Piribia 

7?.»  v^n#rtuelo 

7n.7 

73.? 

73.8 

7ia 

74.1' 

158. ^• 

65  f 
71.7 

72.  : 
72.1 
71.: 


Hiaher  Incofff 

511,7   Argfrtiia  67.2 

52.1   Brazil  ^^^f 

47.5  Chiir  1 

<r,^   CblbrMa  45.  I 

•jI  6   Cor,t^  Pica  fi3.< 

D.   PriubHc  57,5 

17, H  Gfe^c**  'i^l 
47.3  Cuatcnala 

^6,4    Israel  ?1 

Jaffa  tea  64  . 7 

Lobahnh  _ 

Maiay^tia  ^^*^ 

•'•xico  61.4 

Nicaraqoa  49,9 

pariaiita  59.3 

Peru  _  54.1 

Singapore  67,6 

Spain  69, € 

Taiwan  68 :e 

Tr  in  i'lad  N . ; 

Tunisia  51  # 

tJrucjuay  68.6 

Ifunpsiavia  67 

2aRibia  43 .5 


Hlddle-lncowe 

iolivia 
Cawrroon 

Cbrqb 

n  Salvadoc 
Ghana 
HbnHuras 
Ivoty  Coast 
Jordan 
S.  fbrea 
Liberia 
Iflorocco 

Paoua  N.  Golnea 

Paraguay 

Philippines 

Syria 

Thailand 

Turkey 

S.  Vietnam 


49.1 
41. 
41. ff 
52. ■» 
5B.5 
37.1 

Ua 

38.0 
52.3 
61.9 
44.9 

5n.5 

46,8 
59.4 
51,1 
52.8 
56.2 
53.7 
50.8 


Uowcr- income 

Afghantctan 

nangl-^Hesh 

Rufwa 

nuruhHi 

Ca'nbofjia 

Alrica  Pep. 
Chad 

pabo^'^Y 
Et^  *ppia 

GO inea 

Haiti 

Ind  i  a 

Kenya 

tabs 

^|ai|a9^8car 

Malawi 

Halt 

Haur itahia 
Nepa  I 
Niger 
Pakistan 
Rwiinda 

Siena  leone 

Sbiiial  iB 

Sri_LanSa 

Taniania 

Togo 

Uganda 

U.  VPlta 

S,  Yeman 

iaveh 

Zaire 


i7, 
MA 
42.3 


H 
4) 
}4 
12 
37 
38 
27 
32 
41 
49.1 
47.5 


8 

8 

5 

r 

,3 
.5 
D 
.6 
.3 


36. 

19, 

37 

41 

40 

41 

51 

41 
41. 
38 
65, 
47< 
4f 
35 
47 
31 
42 
42 
18 


P 

% 
2 
8 
6 
f 
.3 
.8 
.8 
5 

.9 
.6 

,5 
.1 
«5 
•  6 
.3 
3 
8 


23G 


Module  I 


• 

-•- 

j^^^  s  i^^^i  •nf€€.^9ty  mf  U.S.  4.  «  F  f.  J 


3CVI.I.185 


qnrn 


Quantitative  Methods  for  Public  Management 

Lecture  2-1.     CJomparison  of  Batches 

Comparison  of  Batches:     The  use  of  Numeric  and  Graphic  Methods  for 

Comparison  of  Multiple  Batches  (1) 

Lecture  Content: 

1.  Discuss  extensions  of  Unit  1  tools  for  analyzing  two  or 
more  batches  simultaneously 

2.  Show  how  these  methods  convey  characteristics  of  the 
collection  of  batches 

MainTopi^s^: 

1.  Comparing  several  batches  of  data 

2.  Effectiveness  of  these  comparison  tools 

Tools  Introduced: 

1.  Parallel  Stem-and-Leaf  Display 

2.  Parallel  Schematic  Plot 
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Comparing  Several  Batches  of  Data 
Basic  Issue:    Comparison  of  data 

1.  We  know  how  to  organize  and  condense  single  batches 
effectively 

2.  Often  Interesting  data  sets  contain  qualitatively 
related  inultiple  batches 

3.  Need  techniques  to  examine  them  simultaneously 

4.  Need  to  organize  the_ batches  in  a  consistent, 
reliable,  and  effective  manner  to  facilitate 
comparison  and  analysis 

il.     Problem;     Can  the  tools  of  Unit  1  be  used  to  analyze  two 
or  more  batches? 

1.  Develop  simple  rules  for  extending  the  elementary 
techniques  of  previous  unit 

2.  First  step  in  analysis  should  be  organization  of 
the  batches 

3.  Grganizatibn  should  be  followed  by  a  condensation 
of  information 

4.  Specific  questions  to  be  answered: 

a.  How  do  extremes  of  the  batches  compare? 

b.  Are  the  medians  of  the  batches  similar? 
c*  Are  the  midspreads  of  the  batches  equal? 
d.  How  do  the  shapes  of  the  batches  compare? 

5.  Remember  the  batches  must  be  npn-prdered.  Ordered 
batches  are  discussed  in  Unit  3  where  we  concentrate 
on  the  relationship  between  the  batches  and  the 
appropriate  ordered  scale 

III.     Solution:     organization  and  condensation  tools  computed 
in  parallel 

1.     Parallel  stem-and-leaf  display 
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2.    Parallel  schematic  plot 
IV.  Methods 

1.     Parallel  stem-and-leaf  display:    Organization  tool 

a.  Example  shows  a  parallel  stein^and-leaf  display  (2) 
of  the  1970  populations  of  Pittsburgh  and  Omaha 
census  tracts 

b.  Features 

i.     Simple  idea 
ii.     Same  features  as  with'  single  batch: 

A.  "Face  validity" 

B.  Retains  information  on  individual  data 
values 

C.  Flexible 
iii.     Easy  to  construct 

c.  Analytic  Qualities 

i.     Extremes  easily  located 

i±.     5-number  summaries  found  using  depths  for 
each  batch 

iii.     Shapes  of  batches 

d.  Procedure 

i.     Choose  a  convenient  unit*  one  for  all  batches 
together 

ii;     Separate  every  data  value  into  a  stem  and  a  leaf 

iii.     Find  smallest  minimum  and  largest  maximum  for 
the  entire  batch 

iv.    Write  down  the  stems »  one  set  for  all  batches 

V.     For  each  batch,  place  leaves  on  correct  stem 

vi.     Batches  are  separated  in  the  display »  with 
leaves  placed  in  parallel  groups 
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e.  Example ;    1970  populations  for  Pittsburgh  and  Omaha 

i.    Convenient  unit  for  display  is  100  persons 

separate  stem-and-leafs  and  put  them 
side  by  side,  with  common  set  of  stems  (2) 

iii.    Parallel  displays  shows: 

A.  Difference  in  shape 

B.  Difference  in  spread 

e.     Omaha  outliers  (11874,  12458) 

iv.     Square  root  transformation  improves  symmetry 

of  both  batches  (3) 

f .  Another  example:    Undergrad  Cumulative  Average  for 

most  incoming  masters  students  by  undergraduate  background 

(Make  parallel  stem-and-leaf  on  board) 
(Unit  =  1) 

(Note  resemblance  or  lack  thereof) 

g.  Parallel  stem-and-leaf  displays  on  computer: 

Use  STEM  once  per  batch,  specify  same  UNIT,  LPS, 
HICUTi  LOCUTi  for  each  batch.     Paste  stems  together 

Parallel  Schematic  Plot:     Graphical  Condensation 

a.  Example ;     1970  populations  for  Pittsburgh  and 
Omaha  (4) 

b.  Features 

i.     Useful  in  discussing  appearance  of  batches 

ii.    Adequate  comparison  tool  for  nearly  all 
collections 

iii.     Computable  from  parallel  stem-and-leaf 

c .  Analytic  qualities 

i.     Made  on  ordinary  graph  paper 

ii.    y-axis  is  common  scale  for  all  values  in  all 
batches 

XVi.I.189 


ill.    Extremes,  hinges,  and  medians  clearly  marked 

d .  Procedure 

1.    Determine  smallest  and  largest  values  in  data 
set  to  make  scale 

ii.     Compute  5-riumber  sutnmarieis 

iii.    Draw  a  simple  schematic  plots,  one  per  batch, 
in  parallel,  using  common  scale 

e.  Another  example;    Life  expectancies  for  countries ,(5) 
classified  as  to  their  "wealth"  (6) 

i.     Schematic  shows  differences  in  spread    and  (7) 
location 

ii.     Petroleum  similar  to  middle  income,  but  not 
in  midspread 

iii.     Downward  trend  evident 

f.  Schematic  plots,  in  parallel,  on  computer: 

Use  function  BOX  with  all  data  files  as  arguments 
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Topic  2.      Effectiveness  of  these  Comparison  Tools 

I*    Basic  Issue:  onc£  cohdensed  Into  parallel  schematic  plot 
how  much  can  we  learn  about  the  batches  ? 

II.    Try  to  answer: 

!•    Are  there  any  outliers  in  the  data  set  ? 

2.  How  do  the  batches  compare  with  respect  to  shape  ? 

3.  Is  there  any  obvious  relation  among   the  medians  or 

mtdspreads  ? 

li.  Methods 

Parallel  steiii*and-Ieaf  displays  and  schematic  plots  answer 
these  questions 

(Present  several  other  examples  of  unordered  multiple 
batches  and  discuss  appearance  of  each) 
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Quantitative  Methbds  for  Public  Management 

Lecture  2-2.    Transformation  for  Stabilization  of  Spread 

Transformations  for  Stabilization  of  Spread:     The  Use  of  Various 
Algebraic  Transformations  to  Equalize  Spread  Among  Batches 

Lecture  Content: 

1.  Discuss  need  for  transformation 

2.  Introduce  method  of  determining  a  good  transformation 
Main  Tai^lcs^ 

1.  Necessity  of  transforming  a  multiple  batch 

2.  Use  of  medians  and  midspreads  in  finding  a  good  transformation 

Tools  Introduced; 

1.     Median/Midspread  Plot 
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Topic  1.    Necessity  of  transforming  a  Multiple  Batch 

I.    Basic  issuer     Comparison  of  batches  is  difficult  if  batches 
differ  greatly  in  spread 

1.  We  know  transformations  are  helpful  in  changing  the 
shape  of  single  batches 

2.  When  more  than  1  batch  is  being  analyzed,  comparisons 
are  easier  if  batches  are  similar  in  spread 

a.  Examplgj.    Parallel  Schematic  Plot  of  life  expectancies  (2) 
for  nations 

i.    Difference  in  spreads  in  the  5; batches  makes 
conclusions  concerning  location  difficult 

iii    Spreads  are  roughly  equal,  except  for  industri- 
alized nations 

b.  Example :     Parallel  Schematic  of  infant  Mortality 
for  nations 

i.     Locations  similar,  spreads  vary  enortnously 

ii.     if  we  balance  the  spread,  will  locations  still 
be  similar  ? 

c.  Example:     Parallel  Schematic  of  Per  capita  Income 

for  nations  (4j 

Note  relationshijp  between  location  and  spread 

3.  If  comparisons  of  location  are  to  be  made,  task  is 
easier  if  spreads  are  equalized 

4.  We  transform  batches  to  equalize  or  "balance*'  the  spread 

5.  If  comparisons  of  spread  are  to  be  made,  transformation 
is  unnecessary  ;  merely  "line  up"  plots  so  that  medians 
are  equals  and  compare  spreads 

6.  Iti  cbhclusiph,  how  much  cf  the  difference  in  location 
is  due  solely  to  location,  and  how  much  is  due  just  to 
difference  in  spread  ? 
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II.     Problem:     Want  our  schematic  plots  to  tell  their  story  as 
clearly  and  simply  as  they  can 

1.  Symmetry  of  spread  within  batches  is  helpful  for 
sunmarizing  single  batches 

2,  Balance  of  spread  between  batches  is  ^s^ent4a4  for 
comparisons 


III.    Solution:     Choose  transformation  to  achieve  equalization 
in  spread 

1.  The  transformation  will  usually  promote  symmetry  within 
batches 

2.  As  with  trahsfdrmatibris  for  symmetry,  the  search  for  a 
good  transformation  is  exploratory,  and  even  the  best 
transformation  may  fail  to  equalize  spread  completely. 
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Topic  2,    Use  of  Medians  and  Midspreads  in  Finding  a  Good  Transformation 


I      Basic  Issue;    How  do  we  find  the  best  transformation  ? 

1.  We  understand  that  transformation  may  be  essential  in 
comparing  batches 

2.  Since  transformation  affects  the  relationship  between 
the  medians  and  the  midspreads  of  the  batches,  how  do 
we  use  these  values  to  find  the  best  transformation  ? 

II.    Problem:    How  do  we  let  the  medians  and  midspreads  tell  us 
the  Correct  transformation 


1.  We  are  searching  for  a  consistent  relation  between 
medians  and  midspreads 

2.  The  best  way  to  study  the  relationship  of  the  medians 
and  midspreads  is  with  a  scatterpldt 

3.  Could  line  up  schematics,  but  a  scatterplot  is  more  clear 

4.  Best  to  look  at  a  scatterplot  of  log  (Median)  versus 
log  (Midspread),  one  ordered  pair  per  batch 

5.  A  linear  scatter  implies  transformation  is  necessary 

III.     Solution:    Examine  slope  of  the  iog(Median)  vs.  log (Midspread) 
scatterplot 

1.     Suppose  scatterplot  was  close  to  linear  with  an  "eyeball" 
slope  of  p 

2«     Correct  exponent  for  the  transformation  Y  =        is  r  =  (1-p) 

3.  Slope  tells  how  far  down  the  "ladder  of  powers"  to  move 

4.  Slope  of:  1  =  logs   (i.e.,  1tp  =  0) 

2  =  negative  reciprocals  (1-p  =  -1) 
1/2  =  square  roots  (1-p  =  (1/2)) 
0  =  no  transformation  (1-p  =  1) 
-1  «  squares  (1-p  =  2) 

IV.    Method:     tog        lian)  /  Log  (Midspread)  Plot 

,V.     Example;     Lo:   median  /  Log  midspread  plot  for  Per  capita 
incomes  of  .Poultries 
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2»  Features 

^*  in  determining  a  good  transformation  to  compare 

batches 

b.  Scatterplot  made  from  5-number  summaries  of  the 
batches 

c.  Relationship  of  '      ■  Mians  and  log  midspreads 
determine  the  cu  ^ert,  r,  for  the  transformation 

3.    Analytic  Qualities 

a.  Slope  of  the   ,  ,,t(..       ..,  determines  r 

b.  Relationship  between  r  anJ  slope,  p,  is  t  ^  (1-p) 

c.  Random  scatter  (p=0)  implies  no  tranafortnation 
necessary 

d.  Plot  made  on  ordinary  graph  paper 

e.  If  collection  lias  fewer  than  4  batches,  tt:ry  not  have 
enough  points  to  determine  slope 

f.  tog(medlan)  vs  Log(inldextreme)     (or  other  measures 

of  spread)  plot  may  be  used  to  determine  transformation 

g.  For  "well-behaved"  batches,  log(meaS)vi.  log(standard 
deviacion)  is  acceptable  -  note  that  standardizing 
obscures  differences  in  level 

A.  Procedure 

a.  Compute  5 mumber  summaries  for  the  batches  and  find 
midspreads 

b.  Compute  the  logarithms  of  the  medians  and  iidspreads 

c.  On  a  piece  of  ordinary  graph  paper,  plot  log(median) 
as  X  and  log(midspread)  as  Y,  one  point  for  each  (6) 
batch;  or  use  log-log  paper  and  plot  median  vs, 
tnidspread  directly 

d.  Find  a  slope  by  choosing  two  representative  points 
one  at  left  end  of  scatter  (X_ ,  Y.)  and  one  at 
right  end  of  scatter  (>r,  ,  Y-J.    hope  -  (y    -  Y  )/ 
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e.  If  preferred,  ah    "eyeball  slope"  may  be  used--a 
slope  fit  to  the  data  by  eye 

f.  Correct  exponent  of  transformation,  r,  is  (1-p) 

g.  With  new  exponent,  find  transformation  of  5-n\nnbef 
summaries 

h.  Make  new  schematic  plot  of  transformed  data  to 
compare  batches  (7) 

5.  Another  example :     Percentage  of  Individual  Tax  Retunis 
audited  in  Fiscal  1974,  by  state 

a.  4  regions  in  the  U.S.  (8) 

b.  Parallel  stem-and-leaf  shows  slight  difference  in 
spread 

c.  Log (median)  vs.  Idg(midspread)  plot  indicates 
r  «  -5  2/3  J  a  strange  transformation 

d.  Best  left  in  original  unit 

6.  Another  examp le :     Percentage  of  population  illiterate 
in  1960  by  state 

a.  Same  4  regions  used 

b.  Parallel  displays  show  differences  in  both  location 
and  spread 

c.  Plot  has  slope  of  -0.60.     1  -  (-0.60)  =  1.6*  about 
2.  try  squares 

d.  Schematic  transformed  data  shows  equalization  of 
spread  (except  for  Atlantic)  (10) 


7.    Another  example :     Percentage  of  population  illiterate 

in  1900,  by  state  (11) 

a.    Batches  differ  greatly  in  spread 


b.  Unable  to  determine  slope;  is  it  1  or  3  ? 

    2  _ 

c.  Try  both  -1/  (X  ),  and  logs 
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d.  Negative  reciprocals  squares  fail  miserably  (12) 

e.  Logs  quite  good  (13) 
8.    Log  Median/Log  Midspread  plots  constructed  on  computer: 

a.  Use  SUl^RY  to  obtain  Medians  and  Midspreads 

b.  Input  these  into  2  separate  files 

c.  Take  logs  with  REEX 

d.  Plot  with  PLOT 
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Lecture 
Outline 
Location 


Lecture  2-2 
Transparency  Presentation  Guide 


Transparency 


Transparency  Description 


Beginning 


Lecture  2-2  Outline 


Topic  1 

Section  I 
2. a 

2.b 

2.C 

Topic  2 
Section  IV 
4. a 

4.C 

4.h 


t5i 

r  ..d 

^  - 
7.d 

7,e 


2 
3 
4 


5 
6 


8 
9 

id 
11 

12 
13 


Parallel  Schematic  Plot  of 
Life  Expectancies 

Parallel  Schematic  Plot  of 
Infant  Mortality 

Parallel  Schematic  Plot  of 
Per   capita  Incomes 


5-namber  sunnnaries  of  Per  capita 
Incomes 

Log  Median  vs  Log  Midspread  and 
Eyeball  Slope  for  Per  capita  Incomes 

Parallel  Schematic  Plot  of  Logs  of 
Per  capita  Incomes 

PercentagP!??  of  individual  Tax  Re- 
turns Aud.i-ied  by  State  in  1974 

Percent  illiterate  by  State,  i960 

Parallel  Schematic  Plot  of  Squares 
of  %  Illiterate,  1960 


Percent  Illiterate  by  State,  1900 

Parallel  Schematic  plot  of 
Negative  Reciprocals 


Parallel  schematic  plot  of  LOGS 
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Homework 
iihit  2 

1.        These  data  are  from  a  study  designed  to  determine  whether 
varying  thi  report  of  the  results  of  a  controversial  psychological 
study  can  influence  Judgements  about  che  ethics  of  the  research. 

Three  groups  of  subjects  (high  school  teachers)  read 
summaries  of  the  Milgram  (1963)  obedience  study  [Milgraro,  S, 
'^Behavioral  Study  of  Obedience*^  Journal  of  Abnonnal  and  Social 
Psychology I  Vol.  7j  371-378].    These  summaries  were  identical  except 
for  thi  reporting  of  the  results.     One  group  of  teachers  read 

the  ^rctu^l  results  of  the  Milgram  study  (Actual  Results  group) .  

One  group  read  that  nearly  all  of  Milgran-S  sub jects  delivered  the 

highest  shock  available  to  the  confederate  (Many  Comply  group). 
A  third  group  read  that  nearly  all  of  Milgram^s  subjects  refused 
to  deliver  the  highest  shock  to  the  confederate  (Many  Refuse  group). 

After  reading  the  report,  the  teachers  answered  a  nrnnber  of 
questions.    Among  these  questions,-  there  was  a  seven  point  scale 
on  which  the  teachers  were  asked  to  rate  the  ethics  of  the  study. 
(The  higher  the  rating,  the  more  ethical  the  study  was  believed 
to  be) . 

Compare  the  three  groups  and  summarize  the  differences  among 

them. 

THE  DATA: 

Actual  Results:  6,  1,  7,  2,  7,  1,  7,  3,  4,  1,  1,  1,  6 
Many  Comply:  3,   i,  3,  7,  6,  7,  4,  3,  1,  1,  2,  5,  5 

Many  Refuf^e:  5,  7,  7,  7,  6,  6,  6,  6,  7,  2,  6,  3,  6 
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2.._    An  experiment  on_nipples  for  baby  bottles  compared 
different  nipple  designsr- the  conventional  one  having  a  medium 
circular  hole  and  a_new  one^having  a  terminal  slot  .11  inches 
long._   A  special  bottle  permitted  an  unrestricted  flow  of  milk^ 
and  the  new  nipple  was  positioned  horizontally  and  vertically  io 
determine  the  effect  of ^orientation.    For  24  babies,  the  volume 
(in  milliliters  per  suck)  ^as  as  follows: 


Med i am 

Slot 

Slot 

Hole 

Vertical 

Horizontal 

0.81 

1.33 

0.92 

0.50 

2.10 

0.78 

0.78 

i.5G 

1.20 

0.43 

1.60 

1.00 

0.50 

1.70 

O.t: 

0.71 

2.00 

NA 

0.71 

1.21 

0.80 

0.34 

1.35 

(a)  Comp^L-e  these  batches  with  parallel  schematic  plots. 

(b)  Transform  the  batches  to  stabilize  the  spread  of  the  values. 


3«        Suppose  we  collect  measurements  of  FEV  (Forced  Expiratory_ 
Volume)  from  individuals  that  work_at  the  same  factory  and 
are  of  the  same  age,  sex,  and  height.     FEV  is  a  measure  of 
pulmonary  function.     We  subdivide  these  individuals  by 
smoking  status  into  the  groups:     A=never  a_ smoker;  B=  exsmoker; 
c=  present  smoker,  currently  smoking  less  than  two  packs  per 
day.  D=  present  smoker,  currently  smoking  at  least  two  packs 
per  day. 

The  data  are  as  follows: 


A:  260,  275,  260,  290 

B:  232,  230,  246,  245 

C:  224,  202,  262,  225 

D:  180,  195,  202,  175 


Compare  the  Groups. 
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When  the  trial  d>:  Dr.  Behjamiri  Spbck  arid  his  associates 
began  in  1968,  the  defense  challenged  the  list  of  prospective 
jv.rbrs  because  only  9%  werie  women.    A  more  detailed  examination 
of  Jury  v^^nires  in  the  U,  S.  District  Court  fo^  the  District 
of  Massachvrsetts  revealed  that  in  venires  summoned  for  trials 
before  the  .six  colleagues  or  the  trial  judge  between  4  April, 
1966  and  22,  October  1968,  the  percentages  of  women  were: 

Judge  A:  40,  30,  16,  35,  50 

Jidge  B:  36,  32,  32,  27,  29,  45 

Judge  C:  34,  30,  32,  29,  24,  28,  20,  35 

Judge  D:  2/;,  30 

Judge  L:  33,  36,  28,  20,  18,  22,  4n 

Ji2dge  F:  22,  21,  31,  27,  17,  29,  26,  29,  34 


While  those  for  the  trial  judge  wr  .  e : 

Trial  Judge:     16,   18,  14,  6,   18,   15,  9,  24 

(a)  Gomparc  these  batches  of  perceiitages  both  numer-^cally 
and  graphically. 

(b)  Combine  Judges  A-F  into  one  b.-tch  and  compare  the 
trial  judge  with  it. 

(c)  Which  comparison  is  most  effecti^^e  ?    Why  ? 


Four  groups  of  students  were  subjected  to  different  teaching 
techniques  and  tested  at  the  end  of  a  specified  period  of 
time.     Their  scores  are  shown  below: 


1 


Techniques 

 2   _3_ 

59 
78 
67 
62 


4. 

9^ . 
89 
80 
88 


65  75 

87  69 

73  83 

79  81 


Compare  batches  (transformation  i^^u^^eccessary )  to  determine 
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6. 


The  stem-and-leaf  display  below  gives  th'i  percentage  of 
fatnilies  in  eiicH  Manhattan  police  precinct  where  combiried 
income  in  1970  was  less  than  $4,000.     (Data  from  New  York 
Times,  March  30,  1973) 

(a)     Write  down  the  five  number  iunmary  for  these  data,  and 
calculate 


S  =  3/4  *  Mldspread 

(b)  Hhat  evidence  (if; any)  is  there  in  your  answer  to  (a) 
that  this  batch  could  be  maue  more  symmetric  by 
transformation . 

(c)  The  lower  hinge,  median,  and     rp^r  '  Lngt  for  precincts 
in  the  Bronx,  Brooklyn,  Que       ,      •  i  --aten  Island  are 
given  below.     Combine  these  dzts  with  uhat  from 
Manhattan  to  find  a  transformation  that  would  equalize 
the  variability  in  the  five  batches. 


Ptrcent  r.irailies  wich  Income  <  $4,000 


Manhattan 
Unit  = 

10% 


0 
1 
2 
3 


5  6  7  9 

022455679 

0  3  6  7  7 

1  5 


Bronx 

Brooklyn 

Queens 

Staten  Island 


Lower 

Hl^ge 

11-  1/2 

12-  1/2 
6 

5 


Median 

17 
19 

9 

6 


Upper 
Hinge 

30 

24-1/2 
11 
8 
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Solutions 


Letter  Value  fispiays 


//13 
M7 

H4 

E 


Actual 


6 
7 


f/13 
M7 

E 


Many  Coitip ly  ( B ) 


//13 
M/ 

E 


Many  Refuse  (C) 


Schematic  Plots 
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For  the  group  that  read  the  actual  results,  there  were  a  wide 
range  of  opinions;  some  thought  it  was  ethical,  and  others  thought 
that  it  was  not . 

For  those  who  were  told  that  many  complied,  opinions  still 
were  split,  but  more  people  rated  the  experiment  with  middle  values 
(3's,  4's,  and  5's)  indicating  that  they  questioned  or  were  uncertain 
about  the  ethics  of  the  experiment. 

The  most  interesting  result  was  for  the  group  that  was  told  that 
most  people  refused  to  administer  the  shock.    Alinost  all  of  this 
group  felt  t;.':  the  experiment  was  very  ethical. 

long  as  participants  refuse  to  administer  a  shock,  the 
eachers  feit-the  experiment  was  ethical;  but  when  some  were  told 
that  shocks  were  administered,  they  began  to  question  and  disapprove 
of  the  experiment . 
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2.    Letter  Value  Displays 
#8  118 


M4h 
H2h 
E 


.61 


.47 


.75 


,34  .81 
Medium  Hole 


M4h 
28  H2h 
E 


1.6 

1.3  1.9 

1.2  2.1 

Vertical  Slot 


.6 


//7 
M4 

H2h 

E 


.88 


.79 


.96 


.65  1.2 

Horizontal  Slot 


.17 


A)     Parallel  Schematic  Plots 


Med  ium 
Hole 
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B.    Mid  Summaries 

Median  ,61  1,6  ,88 

Midhinge  .61  1.6  ;875 

Midextreme  .57  1.65  .925 


_      F'"'5in  the  parallel  schematic  plots  and  the  midsummaries  we 
can  se>    that  no  transformation  is  called  for. 

Clearly,  the  nipples  with  a  terminal  slot  are  better,  with 
a  vertical  orientation  being  the  best. 


etter  Value  Displays 
A 

H 
M2h 


Hlh 
E 


268 

260  275 
2cl  290 


//4 
K2h 

Hlh 

E 


B 


239 

2:1  246 
230  246 


15 


//4 
M2h 

Hlh 

E 


225 

213  244 
202  262 


i?4 
M2h 
31  Hlh 


188 


178 
175 


199 
202 


21 
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Paraiiei  Schematic  Plots 
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sinoker5"L^«rJ"  "n^^  a  definiie  difference  betweei  the  non- 
pacK  smokers.     The  more  one  smokes,   the  lower  the  F.E  V  Howea 
before  you  could  say  much  about  how  of  a  difference  thl~e  is  n'^I 
data  should  be  obtained.     Four  observations  is  n"t  enough.  ' 
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4.  Sort 
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27 
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24 

18 
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29 

24 

30 
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32 

28 

22 

22 
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29 
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15 
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27 
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29 

18 

34 
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Numerical  comparison 


A 

:  median 

35 

16 

B 

:  median 

32 

27 

C 

median 

29.5 

2b 

D 

median 

27 

24 

E: 

median 

28 

is 

F: 

median 

27 

17 

T: 

median 

15.5 

6 

max 

50 
45 
35 
30 
40 
34 
24 


range 

34 
18 
15 
6 
22 
17 
18 


Stem-and-Leaf  Displays     Judges  A-F  combined 


1* 

1- 

678 

2  A 

0G12244 

67  78899 

3* 

0G01222344 

5566 

4'V 

GO 

5 

5'> 

0 

unit  =  ] 
N  =  37 


M  18 
H  9 
E  1 


24 
16 


29 


34 
50 


outside  value  -  50 
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Stem-i.v:-le^?f  all  judges  A-f&t 


0* 

69 

1* 

^  . 

566788 

2* 

(30122444 

2- 

677889999 

3* 

0001222344 

3- 

5566 

4* 

00 

4- 

5 

5* 

0 

n  =  45 


median  23 
hinge  12 
extreine  1 


28 


2D 
6 


32 
50 


While  the  individual  comparison  of  each  judge  with  the  trial  judge 
shows  that  the  trial  judge's  typical  percentage  of  women  was  lower 
than  the  other  judges,  I  think  the  larger  group  comparison  is  more 
valid.     This  is  because  of  the  total  numbers  involved. 

The     omparison  of  the  two  boxplots  clearly  shows  that  about  757, 
of  the  trial  judges  venire's  had  a  lower  percentage  of  women  than 
the  combined  group  of  judges.     It  also  shows  that  even  when  the  _ 
greatest  percengage  of  women  were  in  the  trial  judge's  venir^,  75 
percent  of  the  combined  grouping  had  more  women. 

Both  comparisons  raise  questions  concerning  how  juries  are 
chosen,   since  mose  of  the  venires  had  less  than  ^0%  women. 
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5. 
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4 
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20 


M  lOh 
H  6h 
E  1 


11 
5 


15.5 


24.5 
35 


S  =  3/4  (24 

.5-llJ  = 

3/4  (13.5) 

•=  10.125 

Bronx 

Queens 

Statefr 

Mahhatt 

UH 

30 

24h 

11 

8 

24.5 

Median 

17 

19 

9 

6 

15.5 

LH 

llh 

I2h 

6 

5 

11 

Mldspread 

18.5 

12 

5 

3 

13.5 

Midhlnge 

20.75 

18.5 

8.5 

6.5 

17.75 

midhinge  ^  midextreraes  ^  median  ^  mean 
17.75  ^  1^  ^  15.5  ^  17.3 


batch  doesn't  trail  off  at  both  extremes »  so  it  might  be  made 
more  symmetric — however  the  ratio  of  maximum  to  minimum  value 
is  less  than  20  which  would  seem  to  indicate  that  transforma- 
tion might  not  help.     ^lso»  though  there  were  differences  in 
the  different  measures  of  typical  value,  they  are  not  very 
large  differences. 

Transformation  to  equalize  variability  in  the  batches — negative 
reciprocal  square  root  of  x.. 


Bronx  Brooklyn  Queens  Staten  Is.  Manhattan 

-.18  -.20  -.30             -.35  -.20  UH 

-.24  -.23  -.33             -.41  -.25  Median 

-.29  -.28  -.41             -.45  -.30  LH 

.11  .08  .11               .10  .10  Midspread 
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QuiZi  Unit  2 
WRITE  AV.;.  ANSWERS  ON  A  CLEAN  SHEET  OF  PAPER 
3art_^.     Answer  the  fbllbwihg  questions  briefly  and  generally; 

1.  What  i^  .r-       ordered  multiple  batch  ? 

2,  How  do  we  best  compare  a  collection  of  related  single  batches 
3;      Why  would  we  consider  a  transfbrmatioh  of  a  multiple  batch  ? 

4.  How  do  we  determine  the  "best"  transformation  for  a  multiple 
batch  ? 

5.  If  a  multiple  batch  consisted  of  2  well-behaved  batches,  and 

^3S  determined  that  a  transformation  was  necessary ,  what 
statistics  of  the  batches  would  we  use  to  find  the  "best'' 
transfortnation  ? 

Pa^^-II. 

1.       Given  below  is  a  data  set  of  median  annual  incomes  of 

individuals  with  doctorates  employed  in  education  (academia), 
government^  and  industry  in  1964. 


Area  of 
Doctorate 

Education 

Government 

industry 

Agriculture 

$11^100 

$11,500 

$12,000 

Biology 

10,500 

11,900 

14,000 

Earth  Sciences 

9^900 

11,700 

13,560 

Mathematics 

10,300 

15,100 

17,000 

Chemistry 

10,000 

12^700 

14,000 

Physics 

11,000 

13,800 

16,000 

Psychology 

10,000 

11,500 

15,900 

S9l 
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Using  the  information  on  the  batches  given  below, _detc™ine  th^ 
best  transformation  for  the  hatches i    You  need  not  carry  out  the 
transformat  ion ; 


Education 

Government 

Industry 

E 

§ ; ?i9PP 

$11 i 500 

$12  iOuu 

H 

10,000 

11,600 

13i75D 

H 

10,300 

11,900 

14,000 

H 

10,750 

13,200 

15,950 

E 

11,100 

15,100 

17,000 

mid spread 

750 

1,600 

2,200 

log  H 

4.00 

4.06 

4.14 

log  M 

4.01 

4.08 

4.15 

log  H 

4.03 

4.12 

4.20 

log  midspread 

2.88 

3.20 

3.34 

2.    On  the  next  page    is  a  detail,  or  small  section  of,  a  display 
given  in  the  book  Profiles  in  School  Support,  1969-1970. 

A.  Briefly  discuss  the  "analytic"  features  of  the  display: 
what  kind  of  display  is  it,  what  do  the  various  lines  of 
each  box  mean,  etc.,  as  explained  in  the  aforementioned 
book. 

B.  Compare  the  4  states  among  themselves. 

C.  Compare  each  state  separately  with  the  United  States. 
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Unit  2  Quiz 
Solutions 

PART  ONE 

An  unordered  multiple  batch  Is  a  set  of  batches  which  havie 
each  been  collected  in  a  consistent  taanner,  containing 
similar  values  having  a  non-quantltatlve  relationship  to 
one  another , 

2.  We  can  best  compare  related  single  batches  through  the  use 
of  parallel  stem-and-leaf  diagrams  and  parallel  schematic 
plots.    We  can  also  ^se  the  five  nicnber  stnunaries.    But  we 
must  be  cautious  to  control  spread  via  a  transformation  if 
necessary. 

3.  Transformation  in  a  multiple  batch  is  used  to  equalize  the 
spread  and  remove  a  possibleconslstent  relationship  between 
spread  and  typical  value  in  the  batches. 

4.  The  method  for  finding  the  -'best"  transformation  consists 
of  taking  the  logarithm  ofthe  median  and  midspread  (some* 
times  the  ^between  the  extremes)  of  each  batch  and  then 
plotting  theje  as  points-in  an  x-y  (log^,  tog^H)  plane.  A 
line  is  drawn  to  approximately  fit  these  points  with  slope  = 
P*^_?^?_^^?^^^?'?'^^9?°^?^9^_^^?'_^^^  d?^^_^^?^_^^  found  by 
^^^^"^^^'^SjP  °'^^_^^^_^^i^^?l_(9'^  lowering)  the  original 
data  by  a  power  equal  to  that  difference, 

i.e.  where    R  =  1  .  p 

5.  The  mean  and  the  standard  deviation 


FART  TWO 

1.      The  slope  of  possible^lines  to  fit  those  points  vary  from 
-2  to  r3._  This  indicates  that  transformations  could  _ 

range  on  the  ladder  of  powers  from  R  «  -1  to  R  =  -2  (negative 
reciprocals  or  negative  reciprocals  of  the  square  root.) 
Take  (x^,yj^)  -  (4.01,  2.88);   (y^^.yi^  =  (4.15,  3.34) 

-  3^^-2:8a  ^  ^  ^  .  50         ^  =  i.o  =  .o 
■  4.15-4.01  ^  .41      ^^^^  , 


29^ 

o 

ERIC 


Log  mldspread 


H.DS 


H.1S- 


«l°P^  "  I'M^'M  ^  tS  ^  approximately  3 

R  =  1-p  =  1-3  =  -2 

transformation:    negative  reciprocal  of  the  square 


Log  median 
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A.    These  styles  represent  bbxpldtSj  but  they  are  lidt  exactly 
as  we  have  defined  them.    The  sblidline  extends  from  the  2nd 
percentile  to  the  98th  percentile.     The  outer  edges  of  the 
boxes  represent  the  level  of  expenditUries  for  the  25th  and 
75th  percentiles.    The  middle  line  in  the  box  is  the  level 
of  expenditures  at  the  50th  percentile. 


Percentiles  involve  dividing  a  distribution  into  100 
sections  with  an  equal  rxuihber  of  observations  in  each  section. 
Therefore^  the  50th  percentile  is  very  similar  to  a  median 
in  that  half  of  the  observations  are  found  on  either  side  of 
it    So  the  ihiiernal  box  is  like  a  bbxpldt  but  the  whiskers 
are  not. 

B.  No  consistent  increase  in  spread  with  typical  value. 
Distributions  quite  asymmetric .    Nonetheless seems  to  be  a 
c'^'aar  trend  of  increasing  typical  value. 

C.  New  Ybrki  New  Jersey  and  Connecticut  all  have  greater 
expenditures  in  75%  bf  their  classrbcmis  than  the  national 
50th  percentile  expenditures  Pennsylvania  expends  more  in 
slightly  over  1/2  bf  that  state's  classrooms  than  the  SO 
percent  bf  the  nation  as  a  whole.     One  rather  interesting 
thing  tb  note  is  that  some  percentage  (2<x<25)  of  New  York's 
classrbdms  expend  more  than  98%  of  the  national  number  of 
classrooms • 

The  distance  from  the  25th  percentile  to  the  75th 
percentile  is  greater  for  the  national  figure  than  for  any 
bf  the  state  figures.    New  York  is  very  different  from  the 
Nation.    New  Jersey  and  Connecticut    are  too,  with  Pennsylvania 
mbst  similar. 
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Some  Principles  of  Graphics  for  Tables  and  Charts 

-   -    This  brief  handout  discusses  some  ideas  on  che_effective  use  of 
graphics  in  technical  papers  andpresentacions.     Some  of  these  princi- 
ples are  due  to  Edward  Xufte ,  whose^ lecture  on  23  April  1976,  given 
to  the  Statistics  Department  at  Harvard  University,  is  the  basis  for 
this  discussion. 

We  will  discuss  the  7  principles: 


1) 

Less  is  more 

2) 

The  3  purposes  of  graphics  for 

communications 

3) 

Small  multiples  are  useful 

4) 

Think  about  page  arrangement 

5) 

Integrate  text  and  graphics 

6) 

Three-dimensional  graphics  are 

special 

7)     Graphics  should  have  *'rough  drafts" 


These  principles  will  be  introduced  by  means  of  various  examples  of 
graphics  taken  from  many  sources^  inclading  The  Wall  Street  Journal, 
the  New  Yorjcer  and  Scientific  American^  P?^??^?^^?  Pf^^^V 

subjective — -what  we  think  constitutes  a  good  grsphic  may  not  agree 
with  your- conception  of  a  good  display.   .After  ail ,  .graphics  ar.^ 
visual  and  works  of  art;  there  is  a  subj ective  aspect  to  their 
appreciation.     However,  we  believe  that  these  principles  are  sound 
and  can  turn  bad  displays  into  good  ones,  if  they  are  followed. 

Principle  1:    tess  is  More 


Never  try  to  crowd  too  much  information  into  a  display.  Two 
or  three  graphics  are  much  easieron  the  eye  than  one  graphic.  If 
you  feel  that  the  display  under  development  contains  too  much  infor- 
mation and  might  overload  your  readers'  circuits,  make  two  or  three 
displays  from  the  original,     Gr,  if  you  strive  for  simplicity,  merely 
take  tBe  most  important  features  from  the  original  display  and  discard 
the  remainder.     Remember ,  graphics  must  be  interpretable  by  the  average 
fellow.     A  reader  should  not  spend  the  majority  of  his/her  time  trying 
to  decipher  the  tables  and  charts  contained  within  your  piaper. 


2GB 

XVI.I.2'!i2 


Module  I 


Figure  1  Is  a  histogram,  where  the  bars  are  broken  Into  various 
cbmpbrients  of  personal  expenditures,  by  percentages^    There  is  just 
the  right  atopunt  of  information  in  this  display^    Any  additional  bars, 
or  additional  categories  of  expenditures  would  make  this  display  uh- 
interpretable.     In  contrast,  we  present  Figure  2,  a  bar  charts  with 
the  same  construction  as  the  histogram  in  Figure  1-     The  wild  plaids 
of  lines  in  the  bars  of  the  display  make  it  difficult  to  read.  There 
are  too  many  cities  included  here.     Can  you  find  additional  dis- 
agreeable features. 

Figure  ^  is  an  example  of  a  bar  chart,  a_display  similar  to  a 
histogram,  but  with  a  horizontal  axis  referring  to  various  charac- 
teristics about  the  data  set.     The  axis.does  hot  have  a  scale  as 
with  a  histogram.     This  bar  chart  is  difficult  to  examine  because 
of  the  curved  bars,  although  it_may  be  pleasing  to  the  advertising 
firm  that  constructed  it.    The  moral  of  the  figure  is:     Do  hot 
try  to  make  your  display  too  ornate  if  this  excessiveness  detracts 
from  its  comprehension. 

We  have  included  2  other  displays  from  The  Wall  Street  Journal 
that  are  quite  good.     Figures  4  and^^  are  both  bar  charts  that  are 
pleasing    because  of  their  simplicity,  and  their  effectiveness  in 
conveying  thieir  message.     Notice,  however,  that  the  border  around 
Figure  5  is  unnecessary — the  arrow  is  catchy,  but  the  numbers  should 
speak  for  themselves. 
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Figure  3 


helps  iQ  iccdunt  fer  13  years  of 
feoonllH^is 


Beechamis  an  incemBClonai  company 
Not  only  have  its  t radihg profits  becri 
•J^rcasing  rontlnubu^^ 
i$_impprtanL  the  number  of  oountncs  in 
which  these  profits  arc  earned  has  been 
increasing  at  the  same  time  Last  year  the 
li'S^^^  ^^^rc  earned  jn  a^^  one  count  r\  - 
wh ich h appcried to bcthelJ  K  - u as 
onI>  I99pcrcentofthetotal. 


So  what  arc  the  highlightsof  l975/7e'' 

»  Worid-widesale?  li  Ol  I  4mijlion  Upb\ 
$232  4million  or 29:8 percent. on  IQr4/;5 ' 

♦  Trading  profit  $l9b  2  million  Upby  _ 
$53  9miIhon.or44  1  per  cent,  on  1974/75 

*_  Pre-tax  profit  1 1  tl  B  rnill  lOniUp  b\ 
$52  3mi!lion:or47  3  percent  on  1974/75 
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Principle  2:     Graphics  are  for  commanication 

Graphic  displays,  as  a  substitute  for  oral  or  textual  communica- 
tibn  have  3  purposes:     exploration,  reconstraction,  and  decoration. 
Displays  should  be  truthful  and  not  misleading—see  Hov  to  Lie  With 
Statistics  by  Darrell  Huff  for  some  very  V*dishonest"  graphics.  Graphic 
tools  should  attempt  to  reconstruct  reality  and  allow  the  reader  to 
explore  more  fully  the  underlying  situation  in  addition  to  decorating 
otherwise  "dull**  presentations. 

Figure  6  is  an  example  of  a  blot  map ,  occasionally  a  very 
deceptive  graphic  device.     In  a  biot_map,  we  darken  all  counties  or 
states  that  possess  a  certain  characteristic.     The  blot  map  repro- 
duced here  was  taken  from  The  New_ York  Times,  and  presents  all  the 
counties  with  15  percent  or  more  positive  net  migration  of  persons 
60  years  and  older  between  1966  and_1970.     These  206  counties  are 
supposedly  the  fastest  growing  retirement  communities.     The  encircled 
counties  in  California,  Arizona,  Nevada ,  Utah,  and  Wyoming,  listed  at 
the  bqttbih,  include  40%  of  the  shaded-in  area  on  the  map:  however, 
only  0.14%  of  the  people  over  60  years  live  in  these  counties !  The 
title  of  the  article  "More  Elderly  are  Retiring  in  the  North",  is 
not  at  all  verified  by  this  map.     One  draws  the  incorrect  conclusion 
that  the  Southwest  U.S.  is  more  popular  than  the  remainder  of  the 
country,  with  the  possible  exception  of  the  retirement  haven ,  Florida. 
The  moral  is:     Blot  maps  based  on  counties  are  misleading  because  of 
the  large  number  of  empty  counties. 

We_alsd  include  a  very  good _display ,  Figure  7^  taken  from 
Scientific  American,  which  very  effectively  communicates  information 
about  nuclear  devices. 

Principle  3:     Small  Multiples  are  Useful 

Graphic  displays  can  be  quite  small.     Many  small  displays, 
arranged  on  a  page,  can  be  quite  effective  in  communicating  your 
message.     F4ru^-&  shows  a  12  x  5  array  of  histograms  presented  in  a 
good  manner. 

In  a  humorous  vein.  Figure  9  is  an  example  of  multivariate 
"faces",  developed  by  Herman  Chernoff.     These  small  figures  are 
used  to  differentiate  observations  from  a  larger  population  when 
more. than  one  measurement  on  each  observation  is  available.     In  a 
face^  display  each^physical  feature  of  a  face  is  controlled-by  the 
value  of  a  measurement^    This  isquite  different  from  a  display  _ . 
which  puts  faces  on  figures  simply  to  portray  the  author "s  feelings 
about  displayed  values  (see  the  light  bulb  example  in  the  section 
"Principle  5"). 


XVI .1 .248 


Figure  6 

ScW^^k  rimm^.  February  X.  1976 


Tli£  nvw  YOtiK  TUJIS.  SU 


-0  P*  tC".*  rt  y»i' wnd  C'CC-  Dfl/  tf  r  atOjnOiprc- 


17 


hr«  Vtrt  tiwi/»il  I.  iiu 


More  .ilderi:/  Are  Hstidng  in  the  North 


CaUforhia 


Amador 
Ca 1 avcras 

Mariposa 
Mono 

Tuolumne 


Arizona 


Moliavc 
Yavapa  i 
Yuma 
Graham 


Utah 

Washington 

Wyoming 
Campbell 


Nevada 

Chunchill 


 3nv. 


QMPM 


Figure  7 


11 


I 

!  I 


T    1t5i    I    Its?    [    ItM    \  I    19SS    I  igsT" 


6 


US  vssn 

Q      Q  AtL  FiSS'ON  BOMBS 

Q      Q  SUPERBOMBS 

OTHER  MIXED  

nssiON  fusiON  POM9S 


11 

I  I 


MIKE  AND  OR  BRAVO 


SUPERBOMB 


  '   KING  t         t  >  ♦  f 

. .  T  9^   

a 


^JOC  4 


SUPERBOMB 


T^'O  H_>  PpTHrTic \i.  6l  7(  O^it^  *fr  iiowuiaini  in  an  •ITori  •lirrnaiitr  uotVi'  lii.  Both  hritiflirf  orliiniir  •(  liriuar):  1950. 

«•  j'^^  n»¥'J'  Ill  i  i.»»r^i»rrn  hunUrd  m  »  IVS.  dr-  tlir  a«if  Pffidria  Tfurnxn  •iinoiifii-rd  ItU  drfi-ioh  lb  j^o 

ci-ipn  npi  10  jirorvrd  unii  llir  *i.,^f  r i.onil  T»ir*  arc  drj-ir  i.  d  in  lh»  mpetLbmh.  TJir  circlr*  d<Mtbic  huf  Ifii  ir'i  f  »pIo  ioii».  il«r 

hiMorK«l  rinM  i?  Lr«nji^>  of  ll.-  Ihiif  j.nr  r rjirr »**niiMi.  rii.-  artujl  UUvU  •>«  U  S.  codr  iuiaM.  Arra,*f  ciiKcirrli  i>  pronornoi.*!  m 

«*ria  •••.Tlif  fir-i  jifjmii  i-  rrlrrrtd  io  !»)  ilic  auiiiof  4-  ihr  "«o»i  ihf  fffioh  liiii  fould  he  iciiratcd  liy  ihii  bomli.  Bomb*  ft  "npny 
pt9lMt  %i\tntiWt  ^or\6'^^^^  "«  tiUlohil  ki«c  b#*n  •miiitd  ahrr  )9;0. 

110 


Source:  Herbert  F*  York,  •'The  Debate  Over  the  Hydrogen  Bomb,*'  Scientific  American^ 
233  (October,  1975),  p.  110. 
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Principle  Ai    Think  about  Page  Arrarigeinent 

When  preparing  graphics  for  publication,  or  written  presentation, 
it  is  worth  spending  ssveral  minutes  considering  the  arrangement  of 
your  displays  on  the  printed  page.    The  following  display  nicely 
summarizes  the  fourth  principle  of  good  graphics. 


(Figure  10  is  an  example  of  a  cute  blunder  from  The  New  York  Times.) 


Figure  10 


Correction 

.  In  last  u-ecfe'j  Rcvtcw,  a  drawing 
of  the  /ioback  whale  appeored  this 
way: 


__  The  drawing  should  have  appeared 
Ihts  woy: 


__The  JJcvicw  rc^rtls  the  error. 
Widlcs,  however,  do  spend  [ust 
About  OS  much  ot  their  lime  swun- 
mlng  on  Ihcir  bacl^s  os  they  do  right 
side  up. 


Hew  York  Times.  July  6,  1975. 
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Principie  5:    Integrate  Text  and  Graphics 

Modern  production  of  printed  wateriai  has  forced  written  text 
and  graphics  to  become  divorced  from  each  other.     The  placement  of 
all  figures  and  tables  at  the  end  of  the  paper  with  wonderful 
announcements  such  as 


does  not  aid  the  confused  riadir.     Having  to  leaf  through  an  entire 
paper  to  find  a  figure  issential  to  the  development  of  a  hypothesis 
can  be  quite  detrimental.   _Why        figures  and  tables  have  to  be 
alienated  from  the  text?    This  example  from  Xhe  J^ew  Yorker  shows 
good  integration. 


Table  983  goes  about  here. 


PubFrrs 


Over   tlic   j5nst    fificcn   )'cnr$_  mir 
profits  linvc  increased  hy  n  rruiHcst 
millibh  clolbfs.  The  dingrnm  below 


depicts,  from  left  to  right,  tlirce  light 
bulb^  of  stcndily  incrensing  size. 
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Principle  6:    Three-Diiiiehsidnal  Graphics 

Graphics  are  itt  essence  two^diihensional  beasts  since  tiley  nust 
be  reproducible  on  the  printed  page.    However,  as  J^gu^^es-ll  and  12 
verify,  graphics  can  be  drawn  as  two^ditnerisidnal  approxitnatidns  to 
three-dimensional  figures.    The  thing  to  remieinber,  and  we  can  label 
this  Principle  6a,  is:    Professional  artists  can  help  by  making  good 
drawings,  especially  figures  that  are  not  easily  drawn  by  hand* 

But  be  careful!    jjgure^ 13  is  a  poor  example.    In  this  figure, 
ordinary  histograms  have  uselessly  been  tuade  three-dimensional. 
This  is  also  a  poor  example  of  a  histogram.     One  must  ask  whether 
it  really  is  a  histogram. 


Principle  7:     Rough  Drafts 

This  principle  is  simply  stated:    Produce  as  tnany  drafts  of 
your  graphic  displays  as  you  do  of  your  text.    Throughout  this 
discussion  we  have  equated  graphics  with  the  written  word;  con- 
sequently, it  is  to  your  advantage  to  polish  your  figures  and  tables 
as  you  polish  your  text. 
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Figure  12 
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"•■  J^  preiideniitl  vote  by  psnjr  Mcntifteaiion  (1958)  siid  6%  reiiKioui 
MoiUftcaiioa  (I960).  * 


iboni^  _   ftt*^    WtijL  St/p'»|  ^ 


19&8  Pi/tjr  Mtntif«ai»on 

Fipirt  ^36.  1956  prnideniul  v«fe  p*riT  Id^nriGation  (1951)  ud  by  rrligious 
i^niiCcilion  (1960). 


Source:  Philip  E.  ConvLrsc,  "Religion  and  Politics:  The  1960  Election,'*  in 
Angus  Campbell,  Philip  E.  Converse,  Warren  £.  Hi  Her,  and  Donald  E.  Stokes, 
Elections  and  the  PoliUcnl  Ordv-r  (Ne»u  York:  Vil.y,  1966),  pp,  102-103, 
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Source:  Office  of  Education,  HEW^  American  Education.  October,  1975^  f«». 
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Module  II 

Intrbductibn  to  Module  II 

Overviev 

Module  II  of  the  Quantitative  Methods  for  Public  Management 
package  provides  students  with  experience  in  handling  complicated 
data  sets  describing  policy  relevant  issues  and,:titus^  promotes 
the  development  of  anaiytically  oriented  managerial  skills.  Two 
kinds  of  skills  are  emphasized:    performance  and  criticism.  The 
module  contains  two  units,  miod>ers  3  and  4.  Unit  3,  y  versus  one 
X,  introduces  the  student  to  modeling  a  relationship  between  two 
variables:    a  carrier  variable,  X,  and  a  response  variable,  y. 

The  general  strategy  is  to  use  a  linear  model  of  the  relationship 
and  explore  the  utility  of  various  transformations  oh  X  or  y  or 
both  in  ijnproving  the  fit  of  a  linear  model.    Fittibg^  modeling^ 
finding  equations  for  data^  and  evaluating  a  fit  are  all  specific 
technical  skills  taught  in  this  titiit.    Some  simple  procedures 
are  introduced  for  detetmining  a  good  trans format ion  and  for  fitting 
a  line  to  transformed  data.    All  procedures  can  be  performed  with* 
out  the  aid  of  a  computer. 

Unit  4^  the  second  unit  in  Module  11,  introduces  the  student 
to  modeling  relationships  between  one  response  variable,  y,  and 
multiple  carrier  variables,  X^.    Transformations  to  improve  the 
reasonableness  of  a  linear  model  are  again  stressed.    In  this  unit 
the  fitting  technique  is  least  squares  regression,  and  the  student 
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receives  ah  extensive  exposure  to  the  mathematical  principles  of 
of  the  least  squares  fitting  procedure  as  well  as  numerous  examples 
of  applications  with  special  emphasis  cm  the  pitfalls  and  dangers 
of  simple  I  mechanical  application  of  regression  analysis  to 
multivariate  data. 

Specific  Objectives 
Unit  3 

Upon  successful  completion  of  unit  3  a  student;  wili  be  able 
to  perform  graphical  analyses  of  multiple  ordered  batches  of 
quantitative  data,  sunmarize  these  batches  using  the  notions  of  a 
conditional  typicalvalue,  construct  scatterplotis  of  X,y  data  sets 
in  which  X  and  y  are  quantitative  variables,  use  a  line  fitted 
through  the  conditional  typicals  to  model  an  X,y  data  set,  use 
least  squares  to  fit  a  line  to  X,y  data  set,  find  a  transformation 
of  X  and/or  y  to  improve  the  linearity  of  a  fitted  model  for  the 
data,  and  analyze  X,y  data  in  which  X  is  a  variable  indicating  time. 
The  critical  skills  a  student  will  obtain  include  the  ability  to 
evaluate  how  well  typical  conditionals  suSnarize  batches,  evaluate 
the  ability  of  a  linear  fit  to  st^arize  a  X^y  data  set,  evaluate 
the  comparative  advantage  of  least  squares  versus  other  fitting 
procedures,  evaluate  the  need  for  a  transformation,  and  evaluate 
the  need  for  smoothing  of  a  data  set  in  preparation  for  an  analysis 
of  time  series  data. 
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Phlt  4 

Upon  successful  compietion  of  Unit  4  a  student  will  be  able 
to  construct  a  model,  for  continuous  multivariate  data  usinss  the 
least  squares  procedure,  find  transformations  that  improve  a  least 
squares  fit,  interpret  coefficient  values  in  a  regression  model, 
use  indicator  variables  and  splines  in  model  construction,  perform 
inference!  on  coefficients,  perform  regressiofft  raaiyses  on  a 
computer  and  by  hand,  and  evaluate  a  fitted  model. 

In  this  unit  the  critical  skills  the  student  will  learn  center 
around  comprehension  of  the  effectiveness  of  the  least  squares 
procedure  as  a  fitting  technique  and  the  problems  that  arise  when 
honlinearity  is  present ,  when  bverfitting  occurs,  when  residuals 
are  not  normally  distributed,  and  when  carrier  variables  are  cbllinear. 
Students  will  be  able  to  evaluate  the  appropriateness  of  using  the 
least  squares  fitting  procedure  for  specific  data  sets  and  be  able 
to  determine  whether  results  are  due  to  relationships  in  the  data 
or  to  the  peculiarities  of  the  fitting  procedure.    Since  this 
technique  is  one  of  the  most  common  analytic  procedures  appearing 
in  quantitative  policy  studies,  the  "doing"  and  "criticizing" 
i^kills  learned  in  Module  II  will  be  very  important  to  the  practitioner. 
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Unit  3 
Reading  Assignaents 


Prerequisite  Inventory 

Tukey,  Chapter  5 

Tufte,  DAPP,  Chapter  i 

"Graphics  for  Scatterplots" 

McNeil, ^Chapter  3 
Tufte,  BAPP.  pp.  65-108 
Tukey,  Chapter  6 


HcNeil>  Chapter  6 
Tukey >  Chapter  7 


in  addition,  read  the  following  articles  in  Tanur  et  al.: 

pp.  120-129 
153-161 
195-202 
354-361 

and  the  following  articles  in  Tufte,  QASP: 

pp.  37-67 

113-125 


texts: 

McNeil,  Donald  R.  ,  T......^tv.  Bata  Analysis.  New  York:    J5hn  Wiley 

&  Sons ,  1977. 


fanur^  Judith,  et  aj.,  editors.  Statistics;. 


le^g  the  tJhknown , 


San  Francisco:    Holden-Dsy,  1972 

Tufte,  Edward  R.  ,  i^alvsts  for  Politic. Hsnd^qUa,  Engiewood 

Cliffs,  N.J.:    Prentice-Hail,  inc.,  1974. 

Tufte  Edi«rd  R  ,  editor,  th>  fiuantitativ^nslvgia  of  Social  Problems , 
^    R^dftr  H*s^*cSsettB:-Addiaon-Wes^      Publishing  Co.-,  1970. 

tukey,  John  W.,  i^^^^^^r^  n.ta  Analvsia.  Reading,  Massachusetts: 
Addison-Wesley .  1977. 
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Module  II 


Prerequisite  Inventory 
Unit  3 

Unit  3  of  Module  1  focuses  on  the  analysis  of  ordered  miltlple 
Batches  and  paired  batches  of  data^  I.e.,  data  In  which  every  element, 
of  one  data  vector  Is  associated  with  an  element  In  another  data 
vector.    As  In  the  prior  two  units ^  the  skills  to  be  learned  In  • 
this  unit  presuppose  mastery  of  several  elementary  concepts  and 
procedures.    Before  proceeding  to  Unit  3^  you  should  assure  your* 
self  that  you  are  familiar  with  these  basics. 

This  Inventory  Is  divided  Into  the  following  four  sections: 

1.  Review  of  Units  1  and  2 

2.  Functions — Paired  Observations,  Notation,  Plotting 

3.  Special  Tygea^of-Functiema — Linear,  Absolute  Value, 

Exponential,  Inverse,  Logarithn,  and  Polynomial 
4  •    Kroperttes-  of  Punctiona — Minimiza t  ion 
Additional  references  to  these  topics  appear  in  the  Appendix. 
Homework  problems  have  been  assigned  which  require  use  of  these 
concepts.    If  you  discover  that  you  are  weak  in  areas  which  will 
not  be  covered  In  class,  you  should  consult  appropriate  course 
personnel  to  arrange  for  tutorial  assistance  and/or  a  reading 
guide  to  background  material. 
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Section  !•    Review  of  Units  1  and  2 

Data  acquired  by  data  analysts  are  usaally  organized  in 
«b4tra^ry  fashion.    While  arbitrarily  organized  data  may  make 
retrieval  of  specific  values  easy  (e.g.,  an  alphabetical 
organization  of  test  grades  for  students  In  a  class)  it  obscures 
the  behavior  of  the  batch  of  valued  and  tnalces  continued  analysis 
of  the  batch  difficult.    The  stem^and^leaf  display  is  one  tool 
the  data  analyst  may  use  to  organize  data  analytic  ally.    This  type 
of  display  possesses  features  of  a  numerically  ordered  sort  of  the 
values  and  of  a  histogram  simultaneously.    While  it  permits 
retrieval  of  individual  values  it  also  provides  a  picture  of  the 
shape  of  the  batch  and  permits  one  to  obtain  the  order  statistics 
by  counting  in.    In  constructing  a  stem-and-leaf  display  one 
first  notes  the  extreme  values  of  the  batch  and  makes  a  choice 
of  unit  for  the  leaves.    These  are  placed  to  the  right  of  a 
vertical  line  which  breaks  the  original  values  in  the  batch  into 
stems >  which  are  multiples  of  the  unit|  and  leaves.    Thus,  the 
numbers  -30  through  30  would  appear  below  in  one  possible  stem- 
and-leaf  display. 


-3 

0 

-2 

9876543210 

-1 

9876543210 

-0 

987654321 

6 

0123456789 

i 

0123456789 

2 

0123456789 

3 

0 
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(Notice  that  the  location  of  zero  on  the  number  tine  has  been 
split  into  -0  and  +0).    It  does  not  matter  whether  smaller 
values  appear  towards  the  top  of  the  display  or  towards  the  bottom 
the  choice  is  up  to  the  data  analyst.    S£r etched  versions  of  the 
display  are  possible  by  using  two  lines  per  stem  with  leaves  with 
values  from  0  to  4  on  one  line  and  5  to  9  on  another  (using  *  and  . 
as  raninders)  or  five  lines  per  stem  (using  *,  t,  f,  s,  .  as 
reminders).     It  is  sometimes  necessary  to  change  the  unit  in  the 
middle  of  a  display.    By  using  asterisks  as  place  holders  and  placing 
leaves  on  the  set  of  stems  with  the  correct  mzniber  of  asterisks, 
such  compound  stem-and-leaf  displays  can  be  created.    The  integers 
from  80  to  200  illustrate  this  implicit  increase  In  unit. 

8*  J0123456789 
9*  10123456789 

1**  (0123456789 
2**  |o 

(Note  the  Inclusion  of  a  blatdc  row  to  separate  the  parts  of  the 
display  which  are  based  on  different  units.)    Also  note  that  the 
change  In  unit  means  that  hot  all  the  integers  froBB  100  to  200 
are  displayed.    Rather,  the  units  shift  leads  to  representing  only 
the  integers  110,  120,.  .  .  ,  190^  200.    Remember  that  when  making 
8tem-and-leaf  displays  free  hand,  care  must  be  taken  to  line  up 
leaves  under  one  another;  otherwise  the  display's  ability  to  give 
an  accurate  impression  of  the  shape  of  the  batch  may  be  compromised. 
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If  the  8tem-and-leaf  display  stiii  seems  a  bit  confusing 
diapter  i  of  McNeil  or  Chapter  i  of  fukey  should  be  reviewed. 

While  the  stem-and-leaf  display  is  a  handy  and  effective 
data  organizing  tool^  f br  soine  purposes  it  may  retain  too  much 
information.    Putting  the  Information  that  is  in  a  batch  into 
numeric  and  graphic  luminaries  is  called  condensation.  While 
some  information  is  lost  in  this  condensing  process,  it  reduces 
the  number  of  distracting  factors  to  a  small,  easily  appreciated 
set  of  values  or  aspects  of  pictures.    These  summaries  are 
usually  more  easily  manipulated  and  contrasted  than  are  stem-and- 
leaf  displays.    But  remember,  they  are  not  as  informative  as  a 
stem-and-leaf  display,    it  is  usually  wise  to  examine  a  stem-and- 
leaf  display  of  a  batch  before  condensing  it. 

The  five  number  smimary  contains  the  median,  hinges  and 
extremes.    The  median  is  the  value  obtained  by  counting  in  the 
sorted  batch  halfway.    It  is  located  at  the  "depth"  (N+l)/2  where 
N  is  the  total  number  of  values  in  the  batch.    If  N  is  even, 
then  the  median  is  the  mean  of  the  two  middle  values.    The  hinges 
are  located  at:   (depth  of  the  median  +l)/2.    the  extremes  are 
simply  the  largest  and  smallest  values  in  the  batch.    Ifiey  are 
located  at  either  end  of  the  batch  at  depth  i.   Tukey  suggests 
the  following  letter  va^ue  d4aplay  for  this  summary. 
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DEPTIB 
M  ^ 

R 

£ 


FffiDiSN 
Hinge  Hinge 
Extreme  Extrcsne 


An  extended  version  of  this  display,  a  jenced  letter  display,  gives 
tfie  value  of  the  midspread  (the  difference  In  the  hinges),  a  step 
(1.5  tiseii  the  midspread)^  the  inner  fences  (1  step  beyond  each 
hinge) ^  the  oixter  fences  (2  stepis  beyond),  and  the  adjacent  values 
(the  last  values  before  each  inner  fence). 


Ihnet 

Ffttice 


Outer 
Fence 


Itiner 

Fence 


Outer 
Fence 


Adjacent 
Values 


(Mote  that  for  some  batches  adjacent  values  and  hinges  may  be  equal,) 


the  schematic  plot  is  a  graphical  summary.    It  represents 
the  values  from  a  fenced  letter  display  as  a  picture. 


loo 

^  ^  _  *  Outdide 
Out.ide       ,  y.iu,. 

Values  ' 


Outer  Fence    I         ••-inner  ,  t 

BO"*    i  I  J  I 

r«r  Outeide    j         '  i     ninie     Medieti  Hinge        tT    i  • 


OutetdS  Adjacent      -  , 

Bote  that  the  inner  fences  and  outer  fences  are  not  actually 
drawn  iii  the  display. 

The  schematic  plot  can  be  drawn  vertically  or  horizontally  at 
the  discretion  of  the  analyst. 

the  utility  of  this  picture  rests  in  its  schematic  quality; 
it  is  a  structural  5utline  of  the  batch.    Of  course,  there  will 
be  batches  vhose  structures  are  not  well  conveyed  by  this  form 
of  display.    Batches  with  separations  between  values  may  fit  in 
this  class.    eohsequehtly»  graphical  summaries  as  well  as  numerical 
Bumaries  should  be  relied  upon  only  after  the  entire  batch  has 
been  examined  in  a  stem-and-leaf  display. 

It  is  easier  to  think  about  and  suSnarize  batches  which  are 
ivmetric  than  those  whic*  are  not.    in  a  symmetric  batch  the  median 
will  be  In  a  position  around  %Aich  the  batch  could  be  folded  with 
one  half  of  the  batch  reflected  by  the  other  half.  Consequently, 
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the  hinges  and  extremes  will  be  equally  spaced  from  the  median. 
When  the  batch  is  symmetrical  ^  the  latd'^htnge  (the  mean  of  the 
hinges,  MldH),  and  the  mid -extremes  (the  mean  of  the  extremes » 
HldE),  will  have  the  same  value  as  the  median.    This  fact  can  be 
used  to  test  for  symmetry.     (Note  that  in  some  batches  iiofaidi  we 
will  call  symmetric  these  values  will  be  only  approximately  equal.) 

Some  batches,  while  not  sysanetrlc  in  the  original  unit, 
becbine  syiametrlc  after  a  simple  power  transformation  of  the  form 
where  r  is  a  simple  power,  a  rung  on  the  ladder  of  powers  and 
where  r  «  0  implies  logarithms.    To  find  the  transformation  that 
best  symnetrizes  the  batch  we  need  only  investigate  the  midstimmary 
values  derived  from  the  batch's  five  m3nd>er  susntary.  lAien 
M  <  Mid  H  <  Hid  E  we  go  down  the  ladder  of  powers;  when 
M  >  Mid  H  >  Hid  £  we  go  up.     It  should  be  noted  that  there  may  not 
be  a  convenient  r  which  symmetrizes  the  batch  in  question.  In 
that  case^  the  raw  values  must  suffice.    The  midstanmary  array  of 
a  transformed  batch  is  called  a  transformation  summary^ 

The  usual  values  of  r  are  1/2  for  square  roots,  2  for 
sqtiares^  -*1  for  reciprocals  (negative  to  preserve  order)  and  0 
for  logs.    Sometimes  we  can  achieve  easy  transformation  simply 
by  using  the  fbllowlng  rules  for  types  of  data.    For  amounts  and 
large  counts  use  logs,  for  percentages  use  the  arc  sine  of  the 
square  root,  and  for  balances  transform  before  subtracting  to  obtain 
the  balance. 
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A  special  kind  of  symmetric  batch  that  has  no  outliers  and 
closely  approximates  a  theoretical  Gaussian  or  Normal  curve ^  is  called 
a  well  behaved  batch •    It  is  mathematically  convenient  to  stxmnarize 
such  batches  with  the  mean  and  standard  deviation.    The  mean,  X,  is 

equal  to: 

N 

ili^ 
N 

and  the  standard  deviation,  s,  is  the  square  root  of  the  variance 
and  is  equal  to 

Another  way  of  thinking  about  the  standard  deviation  Is  to  view  it. as 
the  average  squared  deviation  about  the  mean.  An  important  property 
of  a  well  behaved  batch  is  that  s  «  3/^  midspread.  (Remensber  the 
symbol  means  "approximately  equal  to^.)  We  can  transform  any  well 
behaved  batch  into  a  standard  wellr  l^hav^^  ^^^ch  by  subtracting  the 
mean  of  the  batch  from  each  value  and  dividing  each  difference  by  the 
batch  standard  deviation.  The  resulting  batch  of  standardized  values 
has  mean  s  0  and  variance  =  standard  deviation  =  1 . 

The  Importance  of  this  standardization  process  is  that  a  great 
deal  is  known  about  the  properties  of  standard  well  behaved  batches i 
In  particular,  we  know  what  percent  of  the  batch  lies  between  various 
values.    For  example,  between  -1.96  and  -^1.96  lies  957.  of  the  values. 
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Well  behaved  batches  play  particularly  Important  roles  In  statistical 
Inference  for  regression  by  least  squares »  a  topic  to  be  covered 
later  in  this  course.    Any  well  behaved  batch  is  completely  sixnmarized 
when  its  mean  and  standard  deviation  are  known.  For  ill  behaved 
batches  I  whi<di  are  by  far  the  more  common  variety^  the  mean  and 
standard  deviation  are  very  rarely  sufficient  summaries. 

When  we  have  only  one  batdi  of  values  there  is  little 
additional  analysis  that  can  be  performed.    Once  we  have  answered 
questions  concerning  typical  value,  spread,  shape,  and  separations 
and  have  searched  for  a  symmetrizing  transformation,  we  have  obtained 
just  about  as  much  information  as  we  can.    However,  when  the  data 
come  in  the  farm  ef  multiple  batches,  we  can  expand  our  inquiry 
by  contrasting  the  batches  with  one  another »  Unordered  multiple 
batches  are  a  set  of  batches  (2  or  more)  which  have  no  quantitative 
relation  between  them,  i.e«,  the  batches  cannot  be  located  on  a 
scale.    Ordered  batches,  to  be  considered  in  unit  3,  possess  this 
property.    By  contrasting  batches  we  mean  that  we  can  compare 
typical  values,  shapes,  etc. --all  of  the  features  which,  for  single 
batches,  we  simply  noted. 

To  perform  contrasts  oh  unordered  multiple  batches  we  can 
use  the  same  tools  employed  earlier  but  in  parallel  fashion.  That 
is,  we  can  draw  stem- and- leaf  displays  and  schematic  plots  side 
by  side.  We  must  be  careful  here  to  have  the  plots  on  the  same 
scales  arid  to  use  the  same  units.  We  can  also  use  side  by  side 
nusober  sunmaries  for  contrasts. 
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Often,  a  consistent  retattonship  between  typical  value  and 
spread  may  interfere  with  our  ability  to  make  ah  effective 
comparison.    Tbis  occurs  wben  tbere  is  an  obvious  trend  in  whicb 
typical  value  increases  with  spread  or  vice  versa.    When  tbis 
happens  it  is  very  difficult  to  determine  how  much  of  apparent 
differences  in  typical  values  are  due  to  differences  in  spread. 
Ah  appropriately  chosen  transformation  can  often  effectively 
equalize  sjread  and  permit  us  to  perform  contrasts  with  the  confounding 
influence  of  changes  in  spread  eliminated.    We  can  usually  find 
this  transformation  by  first  calculating  the  median  and  midspread 
for  each  batch  and  then  making  a  scatterplbt  of  ibg(median)  against 
log  (midspread).    If  a  clear  line  seems  to  fit  these  points  we 
crudely  estimate  the  slope  of  this  line,  tn,  and  transform  all  the 
batches  by  taking       where  r  «  l-m,  following  the  rules  for  the 
ladder  of  powers.    One  may  also  view  this  procedure  as  a  way  of 
obtaining  the  appropriate  unit  for  all  the  data.     (Sometimes  it 
may  be  necessary  to  e:iamine  log(median)  against  log  (difference  in 
extremes).) 

Section  2.  Functions 

One  of  the  most  fundamental  concepts  in  modem  mathematics 
is  that  of  functions^    A  function  is  ah  operation  involvihg  two 
iets  of  numbers^  the  input  values  which  are  usually  denoted  by 
and  the  output  values,  usually  denoted  by  y  or  f(x).     (Note  this 
functional  notation.    We  read  notation  f(x)  as  "f  of  x"  or  "  a 
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function  of  at".    Other  letters  in  upper  or  lower  case  may  be  used 
Instead  of  i,    D6  not  confuse  the  use  of  parentheses  here  with  their 
usual  usage  In  an  equation  where  they  lipiy  a  multiplicative 
operation.    BricRets.  t  h  and  other  separators  may  be  used  for 
the  saie  functlcfe  notation  purposes.)    to  each  Input  value,  a 
function  assigns  exactly  one  output  value,    the  set  of  all  input 
values  for  a  fiihction  is  called  its  domain  and  the  set  of  all 
output  values  Is  its  range. 

The  mathethiticai  operations  that  we  use  most  often,  such 
as  square  root^  stjuare,  and  logarttSn.  are  all  functions.  When 
m  use  a  variable  to  represent  the  domain  of  a  fuhctionj  we  call 
it  in  independent  Variable.    The  variable  representing  the  range 
is  called  the  dependent  variable;  it  is  functionally  dependent  on 
the  independent  variable.    Functions  are  usually  indicated  by 
letters  preceding  the  independent  variable  which  is  enclosed  in 
parentheses,    thus,  f(x)  -  iogj,  x  is  the  logarithmic  function; 
f  (x)  -  x^'''^    is  the  square  root  function,  etc.    A  useful  way  of 
thinking  about  functions  is  to  view  them  as  rules  of  correspondence. 
A  graphical  representation  of  this  assignment  process,  in  vhich  val- 
ues In  the  domain  are  assigned  to  values  in  the  range,  is  shown  below. 


Domain  Range 
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We  deal  with  functions  vhose  domain  and  range  consist  of 
real  ninnbers,  ije.,  functions  of  real  variables,    A  subset  of 
these  functions  are  functions  which  have  as  their  domains  only  the 
integers •    These  are  called  functions  of  discrete  variables. 

We  can  obtain  geometric  representations  of  functions  of  a 
variable  by  graphing  or  plotting  the  function  on  a  rectanguljir  or 
Cartesian  coordinate  system.    In  this  system  two  real  tines 
(coordinate  axes)    are  drawn  at  right  angles  on  a  plane  so  that 
they  share  a  common  origin,    the  convention  is  to  label  the 
vertical  line  as  the  y*axis  and  the  horizontal  line  as  the  x«-axis« 


is  called  a  rectangular  coordinate  plane  or  xy-plane.    Points  may 
be  plotted  on  this  plane  in  the  following  way.    An  ordered  pai^  of 
values,  one  for  Sie  x  variable  and  one  for  the  y,  in  that  order 


y-axis 
4-4 


When  a  rectangular  coordinate  system  is  drawn  oh  a  plane  the  plane 


^37 
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0>y  cbriventlbtt).  18  determined,    these  are  usually  given  as  a 
pair  of  values  enclosed  In  parentheses;    to  graph  the  point  we 
move  along  the  x-axls  to  the  location  on  the  x-scaie  that  equals 
the  first  member  of  the  ordered  pair  of  values.    We  then  move 
vertically  above  or  below  this  point  parallel  te  the  y-c3cis  until 
we  reach  an  Imaginary  horizontal  line  intersecting  the  y-axis  at 
a  point  on  the  y-scale  equal  to  the  Second  of  the  ordered  pair 
of  values.    An  illustration  appears  below  for  the  points  (-4,-3), 
(A,-3),  (^4,3)  and  (4,3) 


^  4- 

(-4,3) —--3 
i  2 


(-4.-3) 


I 


-^(4,3) 

i 


-1  * 

-2  I 

-3  -^-(A.-S) 

.4 


Obvioasly,  the  procedure  can  be  performed  In  reverse    order    by  first 
finding    the    y-axis  location  and  then  the  x.    The  dashed  lines  in  the 
illustration  are  provided  for  clarity  and  are  not  drawn  in  practice^ 
To  distinguish  one  point  from  another  we  use  subscripts.  Each 
ordered  pair  receives  a  subscript  value  specifying  its  sequence  in 
the  set  of  ordered  pairs,    if  we  think  of  the  four  plotted  points  as 
having  come  to  us  In  the  sequence  (-4,3),  (4,3),  (4,-3),  (-4.-3) 
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then  we  can  Indicate  ttiese  four  points  as  follows: 
(X3,y35  *  (4,^3) 

In  general,  points  are  denoted  (xj,  y^)  where  i  runs  from  1  for  the 
first  i)61ht  to  the  last  point.  When  (x^,  y^)  appears  alone  it 
neans  "some  arbitrary  point" . 

Formally J  we  say  that  the  rectangular  coordinates  of  a  point  are 
given  by  the  ordered  pair  (x,y)  and  «e  use  the  terms  point  and  ordered 
pair  interchangeably.  The  terms  for  the  x  and  y  values  in  the  ordered 
pair  are  abscissa  arid  ordjLnate.  respectively.  We  also  name  the 
Quadrants  into  which  the  rectangular  coordinate  system  divides  the 
plane  as  follows. 


Quadrants  of  the 
Gartesiau  Plane 


II 

I 

III 

IV 

We  can  graph  ^unctions  by  recording  in  ordered  fashion  values  of 
the  domain  and  corresponding  values  of  the  range.  If  the  function  is 
defined  on  a  continuous  variable  then  the  number  of  possible  points 
that    can    be    graphed    will    be    infinite    regardless    of  whether  the 
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function  is  bounded.    We  represent  the  graph  of  a  continuous  function 
as  a  smooth  curve  and  a  discrete  function  as  a  set  of  distinct 
points. 

The  procedure  involves  constructing  a  table  for  x  and  y  values. 
One  chooses  x  values  in  the  domain  of  the  function  and  solves  the 
equation  for  the  corresponding  individual  values  of  y.    Usually,  a 
few  ordered  pairs  of  values  are  sufficient  to  allow  us  to  make  a 
geoinetric  picture  of  all  the  ordered  pairs  that  represent  solutions 
to  the  equation.    Obviously^  if  the  domain  and  range  extend  over 
all  the  real  numbers,  we  can't  graph  the  function  all  the  way  to 
infinity.    By  convention,  we  place  arrow  heads  on  the  end  of  graphed 
curves  to  indicate  that  the  function  continues  similarly  beyond 
the  last  plotted  point  (although  sicrtnetimes  the  arrowheads  are  left 
out).    Exampies  of  these  procedures  appear  below  and  In  the  next 
section. 

Note  that  wt  ail  equations  in  x  and  y  define  functions  of  x. 
Nor  are  all  curves  that  can  be  drawn  on  the  xy-plane  functions  of 
X.    The  critical  quality  of  a  Sinction  is  the  assignment  of  a 
of  a  single  value  in  the  domain  to  a      single  value  in  the  range. 
Compare  the  following  graphs* 

Mot  functions  of  x.  Functions  of  x. 


i 

X  •  b  y  •  b 
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Not  functions  of  X  Functions  of  X 


In  general,  the  graph  of  a  function^  f(x)|  contains  all  points 
(x,f(x)),  where  x  is  in  the  domain  of  f.  The  procedure  for  plotting 
equations  which  are  not  functions  of  x  requires  obtaining  all  the 
iuitiple  values  for  y  which  represent  solutions  to  the  equation  for  a 
given  value  of  x. 

 m 
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Section  3.    Special  l^pes  of  Functions 
A.    Linear  functions 

Functions  of  the  form 

f (x)  «  ax  +  b 

t^ere  a  and  b  are  real  ntimbers  are  called  linear  functions.  Their 
graphs  are  straight  lines  which  intercept  the  y  axis  at  the  point 
(dyb)  and  have  slope  «  a.    the  slope  is  often  denoted  by  an         and  is 

the  ratio  of  the  vertical  change  to  the  horizontal  change,  or  the 
number  of  unit  changes  in  y  for  a  unit  change  in  x.  The  Greek  upper 
ease  delta, "a",  by  convention,  is  used  to  represent  chant^e.  Thus, 
slope  is 

in  "  7*- 

and  can  be  computed  from  any  two  points  that  lie  oh  si  line,  (x^,  y^), 

(x^,  y^)  by  the  following  definition: 

Ay  ^2  ^  ^1 
Ax     Xj  -  Xj^ 

For    any  pair  of  points  satisfying  a  given  linear  function,  this  ratio 
is  constant,    the  slope  of  the  vertical    line    x  s  a    is    hot  defined 
(note  that  this  equation  is  not  a  function  of  x). 
Some  important  facts   about  lines  follows 

a.  two  lines  with  slopes  m^^  s       are  parallel. 

b.  A  horizontal  line  has  slope  m  s  0* 
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c.    ir  B  >  0  the  line  rises  from    left    to    right,    i.e.,    it  slopes 


d.  If    n  <  0    the    line    fails    from    left  to  right,  i.e.,  it  slopes 
downward . 

e.  the  form  y  s  mx  ♦  b  Is  called  the    alooe-lntereept    form    of  the 


f.  The  form  y-y^  s  ni(x  -  x^)  is  the  point-slope  form  of  the  equation 
for  a  line. 

g.  the  form  ax  ♦  by  ♦  c  =  o  is    the    general    linear    form    of  the 
equation  for  a  line. 

B.    Absolute  £alas. 

The  function  with  the  form 


is  called  the  absolute  value  function.  the  two  vertical  bars 
surrounding  the  x  oh  the  right  side  of  the  equation  are  a  notational 
convention  indicating  that  only  honnegative  values  of  x  are  to  be 
returned.  Thus,  the  domain  of  the  absolute  value  function  is  all  the 
real  numbers  while  its  range  is  the  nonnegative  reals.  Its  graph 
appears  below. 


upward . 


line. 


f(x)  =  1x1 


y  -  lx| 


The  Absolute  Value  Function 
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The  function  with  the  form 

f(x)  ^ 

i*ere  b  >  0,^1  With  all  the  real  numbers  in  its  domain  is    called  the 
^Tpehehfeial  function.    Its  range  is  all  the  positive  numbers. 
The  exponential  function  has  a  graph  with  the  shape: 


when  b  >  1  and  a  shape 


irtien  b  is  between  0  and  1.  Regardless  of  the  actual  value  of  b,  the 
graph  has  one  of  these  two  basic  shapes.  For  b  >  1 ^  as  b  increases 
the  curve  becomes  more  steep.  For  b  <  1 ,  as  b  decreases  the  curve 
becomes  more  steep. 
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The  constant,  is  called  the  base  of  the  exponential.  Two 
common  bases  are  the  irrational  number,  e^  2.71828,  and  the  number 
10.  Note  that  graphs  of  all  exponential  functions  pass  through  the 
point  (0,1).  Furthermore,  although  the  curve  approaches  the  x  axis, 
it  never  actually  intersects  it. 

D.  Inverse 

The  inverse  of  a  function  is  that  function  which  when  applied  to 
a  function  of  x  returns  the  original  value  of  x.  The  inverse,  by 
convention,  is  denoteO  by  a  superscript  -1  placed  at  the  upper  right 
hand  side  of  the  function  just  before  the  left  parenthesis,  e.g.,  the 
inverse  of  f(x)  is  f'^(x)  and  fi[f(x)3  =  x. 

if  we  think  of  a  function  as  a  rule  of  correspondence  which 
assigns  a  value  in  the  function's  range  to  every  value  in  the  domain, 
then  the  inverse  function  takes  as  its  domain  the  function's  range  and 
assigns  to  these  values  the  corresponding  values  in  the  function's 
domain.  For  example,  if  f(x)  is  a  function  with  a  domain  value  of  5 
assigned  to  a  range  value  of  25  then  f  ^(x)  has  a  domain  value  of  25 
to  which  5  is  the  assigned  value  in  its  range.  A  graphical  picture  of 
this  process  appears  below. 


Module  11 


the  inverse  of  the  linear  function,  f(x)  «  ax  Is  f*  (x)  r(i/a)x 
since,  by  substitution,  f"^f((x))  s  f*^(ax)  =  (1/a)(ax)  =  x.  The 
function  f(x)  «  x  is  its  own  inverse.  In  general.  If  f(x)  r  ax+b  then 
f"^x)  r  (1/a)(x.b). 

E.  Logarithm 

Another  ittpbr^tant  inverse  function  Is  the  inverse  of  the 
exponential,  the  logarithmic ,  Recall  that  the  logarithm  of  x  to  the 
base  b  is  defined 

y  =  log^x 

which  is  simply  a  notational  way  of  saying  that 

by=  X. 

to  see  that  the  logarithm  is  the  inverse  of    the    exponential  we 

_-l  _       -        __  _        _x         _:i  _ 

must       determine       f    tf(x)3       for       fix)  =  b  .       Assuming  that 

f"^(x)  s  log^j[f(x)]  we  have    f"^tf(x3]  r  logj.b^  r  x    (iogj.b3  =  x.  In 

other  words,  the  logarithmic  function  is  defined  as  the  inverse  of  the 

exponential,      e.,  it  is  that  function  which  reverses  the  ordering  of 

the    pairs    of  points  that  represent  assigned  values  of  the  domain  and 

range  for  exponential  functions. 

We  can  see  this  graphically  in  the  diagram  below. 


We  see  Trnm  the  diagrat^  that  the  base,  b,  for    the    logarithm    is  the 
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same  constant  that  is  the  base  of  the  exponential.  Where  the  point 
(69 1)  must  be  on  the  exponential  plot,  the  reversed  point  (1,0)  must 
be  on  the  logarithmic.  While  the  exponential  approaches  the  x-axis 
(i.e.^  the  point  where  y  =  0)  but  never  reaches  it  (except  at  x  =  -od 
in  the  example)  the  logarithmic  approaches  the  y-axis  (i.e.,  the  point 
where  x  =  0)  but  never  reaches  it  (except  at  y  =-»  in  the  example). 

E.  Polvhomial.^ 

Functions  of  the  form 


are  called  polynomials  in  x  of  degreis  n  where  n  is  the  largest 
exponent  of  x  for  a  ^0.  The  exponent  of  x  is  always  nonnegative  and 
the  con.stantSt  a^i  are  real  numbers.    Polynomials  in  x  can  be  plotted 

on  the  Cartesian  plane. 

The  lihiear  function  is  a  polynomial  in  x  of  degree  1.  An 
important  polynomial  is  the  polynomial  of  degree  2,  called  the 
duadratlG  function.    It  is  usually  written  as 


The  domain  of  the  quadratic  function  is  all  the  real  numbers    and,  in 

general,  the  graph  is  of  the  following  form. 

it  ^  2 

V  /  y  »  ax    +  bx  +  c 


y  »  fix) 


n 


y  =  ax    +  bx  +  c 
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Note  that  the  graph  of  the  quadratic  function  may  be  shifted  to  the 
right  or  left  or  up  or  down  depending  on  the  values  of  the  constants. 
A  graph  of  a  quadratic  is  called  a  parabola.  Another  form  is 
y  =  g(x-h)(x-k)  where  g,  h  and  k  are  constants. 

A  polynomial  of  degree  3  is  called  a  cubic ,  Graphs  of  cubics 
have  two  bends  in  them.  Polynomials  of  degree  higher  than  three  do 
not  have  special  names.  In  general,  the  graph  of  a  polynomial  of 
degree  n  has  n*i  bends. 

Section  ^.Properties  pf  Functions — Minimizatibn 

Functions  of  x  can  be  evaluated  over  their  entire  domains  or  over 
portions  of  their  domains.  Within  any  interval  on  the  x  axis,  a 
function  is  said  to  be  increasing  or  decreasing  depending  on  whether, 
as  X  increases,  f(x)  is  strictly  increasing  or  strictly  decreasing, 
respectively.  For  some  functions ,  such  as  linear  functions ,  the 
function  >'j:il  be  either  increasing  or  it  will  be  decreasing  over  Its 
entire  domain.  This  is  also  true  of  the  exponential  and  logarithmic 
functions.  However,  polynomials  have  bends  in  them  and  are  strictly 
increasing  or  strictly  decreasing  only  within    some    interval    on  the 


X-axis. 


f(x)  decreasing 


f (x)  increasing 


A  Parabola 


3  IS 
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The    parabola  in  the  example  decreases  in  the  interval  -«><x<0  and  is 


Functions  that  are  strictly  increasing  have  minimum  values  of 
f(x)  at  trieir  left-most  bound.  In  other  words,  linear  functions  with 
positive  slope  have  values  of  f(x)  which  are  less  than  any  other  value 
in  their  range  when  x  is  the  lowest  value  in  their  domain.  Linear 
functions  that  are  decreasing  over  their  domain,  i.e.,  returning 
smaller  and  smaller  values  for  f(x}  as  x  increases,  have  a  minimum 
value  in  their  range  that  corresponds  to  the  maximum  value  in  their 
domain.      If    these    functions  have  domains  which  are  the  real  numbers 

then  their  minima  are  not  defined.  Exponentials  in  which  b  is  >  1 
are  strictly  increasing  and  when  b  <  1  (but  >  0)  they  are  strictly 
decreasing.    The  same  will  be  true  of  logarithmic  functions. 

Linear  functions  that  have  positive  slope  are  increasing;  those 
with  negative  slope  are  decreasing.  To  evaluate  a  function  which  is 
hot  linear  we  may  place  lines  tangent  to  points  on  the  function  and 
examine  the  slopes  of  these  tangents.  If  the  slopes  are  always 
positive,  then  the  function  is  increasing;  if  the  slopes  are  always 
negative,  then  the  function  is  decreasing. 


increasing  in  the  interval  0<x<«  , 


The  absolute    value  function, 


f  (x)  decreasing 


f(x)  increasing 


Absolute  Value  Function 


3!<) 
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Some  non-linear  functions  have  points  where  a  tangent  to  the 
curve  will  have  slope  =  0.  these  are  called  crttioal  points.  If  the 
directionality  of  a  function  changes  within  some  interval  in  its 
domain,  i.e.,  if  it  goes  from  decreasing  to  ihcreasihg  or  ihereasing 
to  decreasing,  then  the  slope  of  tangents  to  the  curve  must  go  from 
positive  to  negative  or  negative  to  positive,  respectively.  To  do  so 
they  must  at  some  point  have  slope  s  0,  i.e.,  they  must  have  a 
critical  point  in  this  same  interval.  If  a  switch  in  directionality 
occurs  in  an  interval  then  the  critical  point  is  a  relative  minimum  or 
relative    maximum    depending    on    whether  the  change  in  directionality 

goes  from  decreasing  to  ihereasing  or  from  increasing  to  decreasing. 

slopes^  m^t  slopes,  m,  ,  positive 

negative,  ^   f(x)  increasing 

f(x)  decreasing 


critical  point 
m  «  0 

f(x)  neither  increasing  nor  decreasing 


if  there  are  ho  other  relative  minima  or  relative  maxima  then  the 
critical  point  identifies  ah  absolute  minimum  or  absolute  maxJ,mum> 


3iJi) 
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Homework 
Prerequisite  Inventory,  Unit  3 


On  a  cartesian  coordinate  system  graph  the  function 
f(x)  «  |x|+10 

for  the  values  of  through  +10.    Label  the  point  (0,10). 

On  a  cartesian  coordinate  syston  graph  the  function 
f(x)  •! 

X 

for  the  values  of  x— 10  through  +10.    Label  the  points  (-1,-1) 
and  (1,1). 

Make  a  plot  of  the  function 

f(x)  «  1^ 
for  integer  values  of  x  from  0  to  10. 
Graph  x-4y  ,     Is  this  a  function  of  x? 

What  arc  the  domain  and  range  of  the  following  functions? 

a.  f(x)  «  l2x-10j 

b.  f(x)  -  if 

X 

c.  f(x)  s  v^c^ 

Locate  and  label  the  following  points  on  a  rectangular  coordinate 
system  and  give  the  quadrants  in  which  each  point  lies. 


(2,7),  (8,-3),  (^,-2),  (0,0) 


The  following  table  indicates  the  nunfcer  of  widgets  that  are 
purchased  each  week. for  four  weeks  and  the  price  widgets  sold 
for  in  each  week.    Plot  price  as  a  function  of  quantity  sold. 
Sketch  in  the  wave.    What  is  the  name  for  this  type  of  curve 
(in  Economic  jargon!)?    What  is  the  relationship  between  widget 
price  and  widget  sales?    Is  this  an  increasing  or  decreasing 
function? 


week 


price/widget 
quantity /week  sold 


20  10 


10  20  25 
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8.  Find  the  slope  of  the  straight  line  which  passes  through  the 
follo«rliig  points. 

a.     (5,2),  (7,5) 

(-2,3),  (3,-1) 

c-     (^2,4),  (-2,8) 

d.     (5,-2),  (A,-2) 

9.  Give  In  functional  form  the  equations  for  linear  functions  that 
have  the  Indicated  properties. 

a.  passes  through  (1,2)  with  slope  6. 

b.  passes  through  (-2,5)  with  slope  -1/4. 

c.  passes  through  (1,4)  and  (8^7). 

d.  passes  through  0,-1)  and  (-2,-9). 

10.  For  the  following  give  the  slope  and  y  Intercept  of  the  line. 

a.  X  «  -2y  +  A 

b.  4x  +  9y  -5  -  0 

c.  .Ix  *  Zy  +  i 

a.  :i    i'  ^  ~A 

11.  The  forec     .'ti  /-opu'' "  tion.  Pp.  of  a  city  is  given  by 

where  P.  ia  ^he  ;i  rreiit  populat±5S,  i  ±s  a  constant,  e  is  Euler's 
number  and  t  i    "'-.^  number  of  years  after  1976. 

if  the  city's  cu.-ent  population  is  100,000  give  the  forecasted 
population  ii  1996  (assume  a  -  .05)  What  Interpretation  can  you 
give  to  a? 

12.  if  6y  -         express  r  as  a  function  of  y. 

13.  Solve  the  following  equation  for  x. 

X  +  1  -  log^l6 

  .      .p,2  log  X 

14.  Simplify  the  expression  10 

  __  ,  „(in  3  4.  2  In  4) 

15.  Solve  the  equation  y  -  e  "  ^ 
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16.  the  draind  equation  for  a  product  is  defined  by 

Eacpress  x  as  a  function  of  p. 

17.  the  fbiioving  graphs  are  typical  of  vhat  kind  of  function? 


A  B 


Module  II 

Homework  Solutions. 
Prerequisite  Inventory^  Unit  3 


T- 


3" 


t  I 


5. 


d. 

Domain  : 

all  reals 

lUnge  : 

all  nonnegative  reals 

b- 

Domain  : 

all  non-zero  reals 

Range  : 

ail  positive  reals 

Domain  : 

all  reals  5 

Range  : 

all  nonnegative  reals 

3CVI.II.3335-i 


6.  4 

7.  f 

\ 

Demand  curve 



As  deooand  rises 
price  falls. 

DecreaBiog  ^ 

(0>0)  does  not  lie  in 
a  quadrant. 


25 


8.    a.  3/2 

b.  -4/5 

c.  not  defined 

d.  b 

a.  f (x)  -  6x  -4 


9. 


b.  f(x)  -  -|  +  I 

c.  f(x)  -  3,c  + 

7  ? 

d.  f  (x)  »  |x  -  M 


id.  a. 


b.    "  a  »  o 


4  5. 
9  *  9 


c.  9/28.  -3/28 

d.  3/2,  12 
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11.  "  27i,828  a  is  the  yearly  jpercehtage  increase 

12.  r  -  i  log.  (6y) 

2       *  ■  •  .     .  ' 

13.  X  »  1 
14. 

15.  48 

16.  X  -  f  (p)  -  16(1  -  -i2S-l  )    5r    10(1  -  l28-£  ) 

1'0792  log  12 

17.  a.  exponential 

b .  logarithmic 

c .  parabolic 


XVI 


.11.35' 


Lecture  3-0.    Introduction  to  Unit  3 


Introdoctioh  to  Unit  .3,  Analysis  of  (X^Y)  data. 


Lecture  Content; 

introduction  to  the  objectives >  problem,  and  notation  of  Unit  3 


Wain  topics: 

1.  Specific  intrbductibn  to  the  Objectives  of  Unit  3 

2.  Presentation  of  General  Problcra  of  Unit  3 

3.  Notation  for  Unit  3 


Note  to  instructor; 

Unit  3^8  primary  pedbgogic  role  is  that  of  a  precursor  to 
regression.    The  lectures  are  set  up  to  provide  studenta  with 
an  intuitive  grasp  of  fitting  lines  to  (X,Y)  data  so  that  the 
application  of  specific  fitting  algorithms  do  not  seem  like 
arbitrary  bperatibhs._  The  essential  notion  remains  that  of 
finding  a  model  that  fitis  the  data  and  quantifies  the  effect 
on  Y  of  movement  albhg  the  X  dimension.    Exploratory  procedures 
learned  in  earlier  units  are  applied  throughout. 
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topic  i.    Specific  iritrbductibn  to  the  Objectives  of  Unit  3 
I.    Questions  to  be  answered  in  Unit  3 

1.  What  is  an  ordered  multiple  batch? 

a.  A  collection  of  batches  related  in  some  quantitative  (1) 
way  (as  opposed, to  unordered  multiple  batches  which 
are  qualitatively  related) 

b.  The  ordered  relation  between  batches  is  defined  on 
some  scale  and  used  in  the  analysis 

c.  Examples:     life  expectancies  for  countries,  classified 
by  per  capita  income  of  country;  number  of  vehicles 
for  transit  systems,  classified  by  the  population 
served  by  system 

2.  What  analyses  can  be  done  on  an  ordered  collection  of 
batches? 

a.  How  can  we  best  cxamiric  the  batches  by  using  the 
ordered  iscale  of  the  batches 

b.  How  can  we_  summarize  the  information  in  the  batches 
and  the  relationship  between  each  batrh,  and  the 
value  for  the  batch  oh  the  scale 

c.  How  can  we  transform  both  the  batches  and  the  scale 
relation 

3.  What  is  an  (X^,Y^)  paired  observational  batch? 

a.  Data  set  consisting  of  two  batches  of  equal  size  (2) 

b.  The  ith  observation  of  the  first  batch,  called  X^, 
is  related  to  the  ith  observation  of  the  second 
batch  , 

c.  We  thus  have  a  batch  of  paired  observations,  or 
ordered  pairs  (X^>Y^) 

d.  Examples:     IQ  scores  of  twins;  achievement  pretest 
score  (XJ  arid  fall  final  exam  score  (Y)  for  each 
member  of  this  class 

4.  What  analyses  can  be  done  on  a  batch  of  paired  observations? 

a.     How  can  we  best  examine  the  scatterplot  of  (X^,Y^) 
values 
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b.    How  can  ve  best  summarize  the  relationship  between 
the  X  variable  and  Y  variable 

c«    How  do  ve  determine  whether  a  transformation  of  either 
X  or  Y  or  both  would  improve  the  sunBnarization 

5.  What  is  a  batch  of  time  series  data? 

a.  (X,Y)  paired  data  set,  where  X  is  time  (months,  (3) 
years ^  decades  etc.) 

b.  One  Y  will  be  associated  with  each  X  i.e.,  impossible 
to  have  two  or  more  observations  at  a  single  point  in 
time 

c.  Exar^ples:     Gross  National  Product  of  the  U.S.  for  the 
years  194<j-1976;  daily  reported  cases  of  swine  flu, 
January-September  1976 

6.  What  analyses  can  be  done  on  time  series  data? 

a.  How  can  we  smooth  the  data  to  remove  irregulartttes 

b.  How  and  when  can  we  extrapolate  beyond  the  current 
time  range,  and  interpolate  between  two  adjacent  time 
points 

c.  What  can  we  say  about  any  periodicities  within  the 
time  series 


II.     Skills  to  be  mastered  in  Unit  3  (4) 

1.  Perceiving  and  ahalyaing  ordered  multiple  batches 

2.  Looking  at  scatterplots  of  (X,Y)  data 

3.  Summarizing  scatterplpts  by  fitting  lines 

4.  Smoothing  the  irregularities  in  time  ';eries  data 

5.  Extrapolating^  in; erpolating ,  and  studying  the  periodicities 
of  time  series  data 
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Topic  2.    introduction  to  the  Problems  of  Unit  3 
!•    What  ±8  an  ordered  multiple  batch? 

1.  Example^    Average  net  interest  cbsti  in  percent^  for  bond  (5> 
sales  for  public  schools,  by  Bond  Moody  rating,  for  various 
years 

a.    Relation:     Percent  interest  for  bonds^  issued  for 
public  schools 

Quantitative  aspect:     Bonds  classified  by  their 
Moody  Rating  9  Aaa-Ba 

2.  The  Quantitative  ordering  is  extremely  important.  We 
can  associate  for  each  batch  in  the  collection  a  value 

on  the  ordered  scale. 

ii.    How  can  we  best  analyze  the  batches?  (6) 
1.    Obvious  questions: 


a. 

Minima 

b. 

Maxima 

c . 

Spreads 

d. 

Medians 

e . 

Shape 

f . 

Units 

2.    Subtle  questions: 

a.  What  is  a  good  typical  value  for  each  batch? 

b.  Gonditiohal  on  being  in  a  specific  batch,  what  is  the 
typical  value  for  the  batch?    We  call  these 
"conditional  typical  values" 

c.  How  are  these  conditional  typicals   used  to  summarize 

the  entire  batch? 

HI.    What  is  a  batch  of  (X,Y)  data? 

Example:    Ninnber  of  vehicles  and  vehicle  miles  for  (7) 
transit  systems  serving  populations  over  1  million 
people,  in  1971 
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*.    X  variable:    Ntmiber  of  Vehicles 

b.  Y  variable:    Transit  system  vehicle  miles,  in 
millions 

c.  11  observations,  1  per  transit  system 

How  can  ve  best  suninarize  this  batch  of  paired  observations? 

1.  What  can  we  learn  from  looking  at  the  (X,Y)  scatterplot?  (8) 

2.  Do  the  data  have  a  linear  point  cloud? 

3.  Or  does  the  point  cloud  have  a  peculiar  shape? 

4.  How  do  We  effectively  suSnarize  linear  point  clouds? 

5.  Can  we  transform  nonlinear  point  clouds  to  make  them 
more  linear,  and  hence  more  easily  sunnnarized? 

What  is  a  batch  of  time  series  data? 

Example:    Total  expenses  for  Cbinnunity  Hospitals  (9) 

a.  X  time  variable:    Year,  1950 ^  1955,  1960-1972 

b.  Y  variable:     Costs    (in  million  $) 

c.  15  time  points 

How  can  we  better  understand  this  tirae  series? 

1.  What  curve  is  traced  by  the  time  plot? 

2.  What  curve  remains  after  the  data  have  been  smoothed? 

3.  Can  we  extrapolate  beyond  the  currerit  range?    What  will  (10) 
expenses  look  like  in  1975?  1980?    What  were  they  in  1940? 

4.  Can  we  interpolate  between  two  consecutive  data  points? 
What  were  expenses  in  1953?  1959? 

5.  Are  there  any  periodicities  in  the  data  set? 

Conclusion:    We  need  specific  tools  to  use  in  analyses  of  each 
of  these  three  data  forms. 
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Topic  3.    Introductton  to  the  Notation  of  Unit  3 
I.    Ordered  Batches 

1.  Capital  letter  ('^")  denotes  data  set 

2.  First  subscript  (Y^)  denotes  specific  batch 

3.  X.  denotes  the  value  on  the  quantitative  scale  associated 
with  batch 

4.  Second  subscript  (Y.  ,)  denotes  specific  observation  in 
a  specific  batch  ^ 

II.     (X>Y)  paired  observation 

1.  Capital  letters  (X  and  Y)  denote  each  batch.    Pairing  of 
batches  is  an  underlying  concept  of  multiple  regress ion ^ 

in  which  one  dependent  variable  |Y)  is  explained  by  (paired 
with)  several  independent  variables  (X's) 

2.  A  specific  ordered  pair  is  denoted  by  (X^,Y^). 

III.    time  series  data 

i.     Same  notation  as  paired  observations 
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Lecture  Analysis  of  Ordered  Batches 

AoAlysls  of  Ordered  Batches:  The  perception^  display,  and  summarization 
of  a  collection  of  ordered  batches 

Lecture  Content;  (i) 

1.  Discuss  the  xechniques  for  displaying  two  or  more  batches 
simultaneously 

2.  Introduce  new  measures  tor  the  sumnarization  of  the  relation- 
ship between  the  muitiple  bacch  and  ^he  ordered  scale 

Main  Topics: 

1.  bisi>iay  of  several  batches  order^:^  scale 

2.  Introduction  of  "cbnditohal  typical  values"  to  summarize  the 
batches 

3.  Discussion  of  the  effectiveness  of    onditional  typical  value::^ 
in  S'.T-^marization 
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Topic  1.    Display  of  several  batches  ordered  on  some  scale 

!•    Basic  Issue:     Comparison  bf  ordered  multiple  batches^  using 
the  natural  scale 

Im    We  know  how  to  compare  and  transform  unorderei  multiple 
batches 

2.  We  are  iriteresiied  in  analyzing  ordered  bKi:u';as  in  an 
effiective^  consistent,  and  reliable  manner 

3.  We  heed  techniques  to  examine  the  batches,  u/^ing  the  scale 
associated  with  the  collection 


II.     Problem:     Can  we  simply  use  the  comparison  tools  of  Unit  2  for 
unordered  batches? 

1.  Specific  questions  to  be  answered  are  similar  to  those  for 
unordered  batches 

2.  What  do  we  do  about  the  i rdered  nature  of  the  batches? 

3.  As  usual,  a  condensation  of  the  information  in  tht  batches 
should  follow  from  an  organization  of  the  collection 

4.  We  organize  the  batches  as  in  Unit  2,  but  our  condensation 
utilizes  the  natural  scale 


III.     Solution:     Organize  Parallel  Schematic  Plots  of  the  batches 
with  positioning  determined  by  the  scale 


IV.  Method 

1.  We  familiarize, ourselves  again  with  the  dcinit ion  of  an 
ordered  multiple  batch:,  a  collection  of  two  or  more 
batches  that  are  related  in  a  quantitative  way 

2.  We  look  ct  some  hypothetical  examples:     life  expectancies 
for  Countries  classified  by  per  capita  income;     number  of 
vehicles  per  transit  system,  classified  by  population 
served 

3.  Here  is  a  real  example  that  we  shall  examine: 

a.  Number  bf  Live  Births ,  classified  by  the  age  (2a) 
bf  the  mother  at  the  time  of  the  birth 

b.  Batch  observations  are  various  years,  1950-1967  (2b) 
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4.  We  plot  the  dbservations  on.  an  (X,Y)  plane  (3) 

a.  X^j  ^  Value  on  scale  for  batch  i 

b.  Y^j  =  Observation  j  in  batch  i 

c.  X^j  is  constant  over  all  j 

d.  Scatterplbt  for  Live  5irth  data 

i,    X. ,  not  well  defined--given  as  range;  e,g,^  24-29 
years 

ii.     Let  X .  .  be  the  midpoint  of  each  interval;  e.g,^ 
25-29  "^years  interval  has  X^^  ^  27 

iii.     X  for  Over  45  and  Under  15?  Arbitrary;  use  47 
and  13 

5.  Next  draw  a  schematic  plot  for  each  batch--centered  at  (4) 
the  correct  X  for  each  batch 

6.  Width  of  box  =  width  of  interval  associated  with  the 
corresponding  X 

?.    Thus  we  have  organized  each  batchy  using  the  position  of 
^>ich  batch  on  the  X  scale 

8,  'Ordered"  Parallel  Schmatics  wicn  CMU-DAP 

a.  Unfortunately  the  plotas  cannot  be  i:^ositicaed  properly 

b.  Treat  each  batch  separately ^  and  cut  and  paste  each 
schematic  on  a  piece  of  graph  paper,  in  the  proper 
place 

9.  Plottirrg  the  raw  data  with  CMU-DA? 

a.    Create  a  X^^  data  file,  constant  for  a  given  i,  to 
PLOT  against  the  Y^^  multiple  batch  values. 
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Topic  2.    Conditional  Typical  Values  to  summarize  the  batches 


Basic  Issas:     Once  organized  Into  parallel  schematics ^  how 
c#n  we  summarize  each  batch 

1.  The  "pattern**  of  the  schematic  display  is  very  important 
In  the  r^nalysls 

a.  bo  the  plots  Increase?    If  so,  is  the  incrctase  roughly 
linear,  or  Is  the  functlu.  il  relation  of  hight^^r  degree 

b.  Do  the  plots  decrease?    Again,  what  Is  the  functional 
form  of  the  decrease? 

2.  We  want  to  pick  one  value  from  each  batch  to  study  further 
the  pattern  of  the  batches 


II.    Problem:    What  value  do  we  use  for  our  sunmarlzatlon? 

1.  The  value  irhould  bc:^  representative 

2.  if  the  spread  of  each  batch  was  zero,  we  would  have  no 
problem  In  choosing  a  set  of  typical  values 

III.     Solntlon:    Use  medians,  our  goc^  frlendl 


1.  The  typical  value  for  Y . .  depends  on  the  batrh  X..  value 

2.  We  compute  typical  values  of      .  ,  "Conditional*'  on  being 
located  in  batch  1 — "conditional  typicals" 

3.  Conditional  typical  value  of  Y^jj ,  given  scale  value  X  = 
Median  of  batch         median  (Y^j^,  Y.^,...,Y,     )  where  ^ 

«  #  observations  in  batch  ±.  ^"i 

4.  For  our  live  births  example — here  are  the  conditional  (5) 
typicals 


We  can  locate  each  conditional  typical  within  each  batch 
on  the  (X,Y)  scatterplot,  and  connect  them 


6.  We  study  the  form  of  the  line  segments  on  this  connected  (6) 
plot 

7.  Hinges  alao  help  In  our  study — we  can  locate  the  hinges,  (7) 

-id  connect  them 

8.  Specific  ^iuestion:      bo  the  line  segrcekU^  connecting  the 
conditional  t3rpicals  form  a  liri^?^^ 
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Secondary  question:    Are  the  spreads  of  the  batches 
constant? 

in  a  later  lecture^  ve  transform  both  X  and  Y  to 

a.  Promote  linearity  of  the  conditional  typicals 

b.  Equalize  spread  within  the  batches 

Conditional  Typicals  constructed  with  CMU-DAP 

a.     Merely  use  SUMHARY  to  find  medians  *  and  draw  them  in 
on  your  scatterplot 
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i>ic  3.    Eif«ctiv«ne5i5  of  Conditional  Typical  Values  tn  SumDarization 

I.    Basic  Issue:    Assessing  haw  well  conditional  typicals  describe 
the  data  set 

1.  The  breaking  up  of  data  into  l^it  +  Residual  has  been 
discussed 

2.  For  ordered  batches:     Y  .  data  value  «  Conditional  Typical 
for  batch  i  +  Residual  ^ 

3.  Fit  «  Conditional  Typical  for  batch  i 

4.  How  much  is  left  after  we  subtract  the  fit  from  each  data 
value? 

II.    Problem:    How  do  we  analyze  the  batch  of  Residuals  from  the 
fit 

1.  Residuals  should  not  be  large  relative  to  the  fit 

2.  The  batch  of  residuals  should  be 

a.  Symmetric 

b.  Ho  obvious  outliers 

c.  Close  to  well-behaved 


III.     Solution:    Analyze  the  residuals  as  a  single  batch  using  the 
tools  of  Unit  1. 


IV.  Methods 

1.    Back  to  our  example-- residua is  from  conditional  typicals  (?) 
for  live  birth  data 

2„     Stem-and-Leaf  Display  of  Residuals.     Note  large  number 

of  zeros ^  and  a  few  outliers  (9) 

3.  Schematic  plot  and  naSber  summary  very  helpful— note 
symmetry    and  outliers  C^O) 

4.  Another  example;    Average  net  interest  costs,  in  (11) 
percent,  for  bond  sales  for  public  schools.     Entries  are 
for  years,  1964-1974 

5.  Find  conditional  typicals,  plot  the  values,  and  find  (12) 
residuals  (13) 
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Beginning  1  Lecture  3-1  Outline 

Topic  1 
Section  IV 
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2b  mother,  1950-67 

4.  3  Plot  of  Live  Birth  data 

5.  4  Parallel  Schematic  plot  of  live 

birth  data 

Topic  2 
Section  III 

5  Conditional  typical  values  for 

live  births 

5^  6  Conditional  typical  values 

connected 

7^  7  Hinges  and  conditional  typicals 

connected 

Toi^ic  3 
Section  IV 

X.  8  Residuals  from  fits  for  live 
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Lc^d  ir>tif»rest  rates 

5.  13  ..  r.diticti^^l  typicals  ccunected 


EKLC 


Ar^A}^fi^^    m4-   ^rdej-ti^  Hr«^\,m^  ^  th'^A/ihj 


 *_  _ 


35i 


XVIilI.60 


nuuuxt!  XX 


4^f 

tilt  37  f 

tfsi 

ffS9 

ffSf 

mi 

mi 

:m 

iU¥ 

%%i 

%m 

/m 

%sfy 

3S2 

XVI.il.6l 


ERIC 


m 

A^ymicr  cf  live,  ^/r^^  iy  of 

Ajc  ff  Afa^Acrj  in  years 


•  ■  f 

flOO 

If  714^ 

fist 

m/ 

ffts- 

mMt 

/m 

my 

XVI.T-.62 


Module  II 


SI 


iimmitr  0  lift  &Ms  h       ^/fof*ir,  tm-mf 


f 

I 

I 

I 


j 

i 

f 


I 
f 

t 

IT 


i 


to 


I 


if  /fc/Aer 


3St 

XVI. II. 63 


ERIC 


XVI .11. 64 

O 

ERIC 


Module  II 


0  ¥tr  ^cr 

SfiOO  i/r  fh  t 

ERIC 


XVIiIIi65 


CC3 


:iS7 


XVIiii.66 


Module  li 


m 


ERIC 


m 


0 

-no 

0 

9 

-/  so 

s 

0 

^fJiO 

f© 

0 

-/© 

*  J" 

0 

so 

10 

-  «r 

a 

SS 

fo 

to 

d 

0 

e 

so 

fo 

is 

a 

0 

6 

IS 

is 

fs 

QS 

s 

0 

0 

so 

fs 

so 

ss 

s 

ss 

^0 

30 

30 

s 

0 

e 

fs 

u 

so 

s 

& 

is 

US 

30 

fo 

30 

s 

Hs 

is 

0 

3S 

iO 

iss 

-a© 

s 

s 

30 

-Ho 

0 

'fs 

-so 

0 

30 

"iSO 

'So 

'S 

40 

-20 

-/fo 

-to 

'10 

0 

-9S0 

-roo 

-IS 

ERIC 


Module  ii 


-/fl}  -//^^  -/se 

900 


XVI.iI.69 


so 


m 


I 


« 


391 


ERIC 


Hoflule  II 


m 


^  a*  Ob  

«^    tfw  s 

■  ^0' 

3./  7 

3.V3 

3.3a 

3. 

3.¥¥ 

37* 

3^1; 

3.#^ 

sit 

392 


XVlilI.71 


9ffM 


r.at 


393 


XVI;II,72 


ERIC 


Hbdiile  li 


liei  Srcfcre^t  Cmff^  %i\  %^  for 

Values 


7- 


4^ 


ft 


9: 


-I  

5*. 


30i 


XVI. II. 73 


Lecture  3-2.    Looking  at  (XjY)  data 

Looking  at_(X,Y)  data:    Analysis  by  reorganization  of  (X^Y)  paired 
oBservation  data 

(1) 

Lecture  Content; 

1.  DiscusBion  of  how  (X,Y)  paired  observation  data  may  be  viewed 
as  an  ordered  multiple  batch 

2.  Sammarization  of  the  ordered  batch  representation  of  the  (X  Y) 
data  set  by  fitting  a  line  to  the  conditional  typical  values 

Main  topics : 

1.  Viewing  an  (X,Y)  data  set  as  an  ordered  collection  of  "mini- 
batches" 

2.  Fitting  a  line  to  the  conditional  typical  values  by  using 
three  mini-batches 


Tool— Introduced ; 

Resistant  Line 
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topic  1.    Viewing  an  (X,Y)  data  set  as  ah  ordiered  cbllection  of  •'mini- 
batches*' 


I.    Basic  Issue:    Consideration  of  a  data  set  of  paired  observations 
as  an  ordered  multiple  batch 

1.  We  know  the  characteristics  of  an  ^X>Yji  paired  observation 
data  set:     two  batches  of  which  the  ith  observation  of 
one  is  related  to  the  ith  observation  of  the  other 

2.  We  have  already  presented  various  examples  of  these  data 
sets:  IQ  scores  of  twins;  scores  on  the  pretest  and  the 
final  exam  for  each  member  of  the  class 

3.  We  now  have  a  good  feel  for  ordered  multiple  batches  and 
the  summarization  of  the  batches  with  conditional  typical 
values 

4.  Can  we  analyze  an  (X^Y)  data  set  as  an  ordered  multiple 
batch  and  thus  condense  it  by  the  use  of  conditional 
typicals? 

5.  Jxam^le:     Percent  illiterate  in  the  population,  by  state, 

in  1930  (X)  and  1960  (Y)  (2  ) 

We  shall  use  this  data  set  in  future  discussions 


II.    Problem:    How  do  we  break  up  an  (X,Y)  data  set  into  multiple 
batches? 

1.  We  uise  the  X  variable  as  the  ordered  multiple  batch  scale 

2.  The  number  of  batches  is  of  course  arbitrary-^-depends  on 
the  number  of  bbservatio;iSi  n,  in  the  data  set 

3.  As  limiting  cases: 

a.  Use  h  mini-batches:     1  batch  per  X  (or  distinct  X) 
value 

b.  Use  only  1  batch--Y  becomes  a  single  batch  of  hombers 

i^.    We  choose  the  number  of  batches  so  that  the  corresponding 
intervals  on  the  X  axis  are: 

a«    Bounded  by  integers 

b.  Approximately  equal  width  (if  possible) 

c.  Containing  equal  nujthbers  of  Y  values 


XVI.II^75 


5.    A  Bcatterpldt  of  the  (X^Y)  data  is  always  the  first  step 
in  the  analysis 

a.  The  plot  helps  to  determine  where  to  break  up  the 
X  axis 

b.  Here  is  the  scatterplpt  for  bur  illiteracy  data— (3) 
note  linear  pattern  (in^brtaht) 


111.    Soluttoni    The  nufflber  of  •'mini-batches"  to  use  is  arbitrary, 
and  their  location  along  the  X  axis  should  be  determined  by 
a  scatterplot  of  the  dbservatidhs 


iV.    Method:    Using  the  Illiteracy  data 
1.    Here  are  the  mini-batches 


a. 

If 

is 

less  than  27o,        is  in  batch  1 

b. 

If 

is 

between  2%  and  4%^  Y^  is  in  batch  2 

c. 

If 

is 

between  4%  and  6%.  Y.   is  in  batch  3 

d. 

if 

is 

between  6%  and  10%  ^  Y^  is  in  batch  4 

e. 

If 

is 

greater  than  10%^  Y^  is  in  batch  5 

2.  Thus  have  5  batches,  3  of  equal  width  2%,  1  of  width  4%, 
1  of  width  lb. 5% 

3.  The  inequality  in  width  was  forced  by  the  clustering  of 
the  data  points  at  the  left  end  of  the  plot 

4.  Here  is  the  data  set  arranged  into  pur  mini-batches. 

Batch  observations  are  the  1960^  7o  illiterate  (Y)  values  (4) 

5.  We  merely  analyze  this  rearranged  data  set  as  an  ordered 
multiple  batch 

a.  Parallel  Stem-arid-Leaf  shows  increasing  pattern,  (5) 
few  outliers 

b.  Parallel  scheiisatics  drawn  so  that  width  of  box  (6) 
width  of  interval.    Spreads  increase 

c.  Compute  conditional  typical  values  >w 

(4) 

Batch  i  0/9% 

Batch  2  1.6% 

Batch  3  2.2% 

Batch  4  3,45%  S^T 

Batch  5  4.35% 
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Tlot  theise  values  and  the  hinges,  connected :oh_a 
separate  plot.    Very  linear^  eyeball  slopie  ~  .3  (7J 

6.  In  conclusion,  the  connected  conditional  typical  plot 
is  very  informative 

7.  However,  if  this_pl6t  ijs  linear,  we  would  formally  like  to 
fit  a  line  as  a  final  isummarization 
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Topic  2.    Fitting  a  line  to  the  cdtiditibnal  typical  values 

!•    Basic  Issue:    Formalisaticm  of  the  analysis  of  (X^Y)  paired 
observational  data  by  fitting  a  line 

1.  We  want  to  know  exactly  how  Y  varies  «lth  X;  i*e.>  if 

Y  «  f(X),  what  is  f? 

2.  We  hope  that  f (X)  «  a  +  bX,  a  line 

3-    If  f  is  not  a_  line,  perhaps  we  can  transform  X  atid/dr 

Y  to  make  it  so.  We  discuss  these  transformations  in  the 
next  lecture 

4.  Note  that  Y  is  a  function  of  X.    In  some  cases  this  is 
obviously  so.    But  X  could  also  be  a  function  of  Yl 

5.  In  some  ways,  which  variable  to  use  as  the  dependents 
variable  (which  variable  is  a  function  of  the  other) 
is  arbitrary 


II.    Problem:    How  do  we  find  the  a  and  b  in  the  equation  Y  «  a  +  bX 

1.  We  would  like  to  use  the  conditional  typical  values  in 
the  fitting  process  ^ 

2.  How  many  mini -batches  do  we  use? 

3.  ^hich  two  points  in  the  connected  conditional  typical 
plot  do  we  use  to  draw  the  line? 


III.     Solution:    t?se  three  mini-batches  of  roughly  equal  size  and 
connect  the  first  and  last  conditional  typicals 


IV.    Method:    Resistant  tine 
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^^^^^^.^^  '^P^.^s.^        istant^ J^lne >  due  to  Tukey.  It 
is  a  fitting  procedure  that  is  resistant  to  outliers  in 
the  data 

Procedure:    applied  to  Illiteracy  Data 

a.    Break  the  data  into  Jthirds  according  to  the  X  (%  illiterate 
in  1930)  values-^-easy  rule  to  apply  to  find  endpdints 
of  our  3  intervals.    If  the  number  of  observations  is 
not  divisible  by  3,  put  the  extra  1  or  2  in  the  middle 
mini-batch.    That  is  not  necessary  in  this  case.  (g) 

b*    Find  Median  X  and  Median  y  in  each  third 
Median  X  «*  midpoint  of  interval 
Median  Y  »  conditional  typical  of  the  batch 

XVI. II. 78  ^ 
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Label  these  three  median  pairs 

Note  that  it  is  unlikely  that  the  (X . . . )  pairs 
will  actually  be  paired  together  in  tht  original  data. 


Locate  these  three  points  oh  the  scatterplot,  and  /q\ 
connect  (X^^v,Y>^.)  and  (X.,.,y/iN).    This  is  the 
fitted  lin^"^^ 

Fonnally  calculate.    _  05 


a  «  Median  (Y^^  -  bX^^^)  »  Median  (0.43,  0.63,0.43)=0-43 

f.  Examine  fitted  line  on  the  scstterpldt.    Note  how  Hn) 
well  it  fits  (except  Alaska) 

g.  Line  inay  need  to  be  "polished"  or  adjusted  slightly 
for  a  better  fit,     *    .  . 

to  determine  how  well  the  line  fits  the  data,  we  calculate 
residuals :  -  ^ 

-  a  -  bX.  ^^^^ 

This  batch  of  residuals  is  extremely  important  in  (12) 
assessing  the  fit.     Treated  as  a  single  batch,  residuals 
should  be  symmetric  about  b,  with  no  outliers.     In  other 
words^  well  behaved,  mean  6.  standard 

deviation  indeterminate. 


Line  constructed  with  CMU-DAP 

Use  function  LINE.  Options  to  save  fitted  values  and 
residuals. 
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Lecture  3-2 
Transparency  Presentation  Guide 


Lecture 
Outline 
Location 

Beginning 
Section  I 


Section  II 
5.b 

Section  IV 
4. 

5. a 

5.b 

5.d 

Topic  2 
Section  IV 

2. a 

2.d 

2.f 

3. 

5 


Transparency 
Nu^er 
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10 
11 
12. 


Tranisparehcy  Description 
Lecture  3-2  Outline 


Illiteracy  Data^  per  State^ 
1930  and  I960 


%  Illiterate  1930  vs.  % 
Illiterate  1960 


Z  Illiterate  1960  classified  into 
Mini-batches 

St em^and-Leaf  displays  of  Illiteracy 
Hini-batches 

Schematic  Plot  of  Illiteracy 

Mini-batches 
Connected  Conditional  Typlcals  for 

1960  Illiteracy  Data 


Illiteracy  Data,  broken  up  into 
thirds 


Thirds  of  Illiteracy  data  and 
connected  conditional  typlcals 


Z  Illiterate  1960  vs  Z  Illiterate 
1930 


Illiteracy  Data,  Residuals 


Residuals,  Stein«-and-Leaf 
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tectare  3*3.    Sustnarizing  Scatterplots 

SOTMrixing  Scatterplots  of  data^    Transforming  (X,Y)  data 

sets  to  improve  the  linear  fit,  and  fitting  lines  by  least  squares. 

Lecture  Content; 

1.  f ratxsformatidns  of  (X»Y)  data  sets 

2.  Least  Squares  Principle  and  coefficient  estimates 

3.  Assessing  the  fit 

Main  Topics; 

1.  Transformations  to  improve  linearity  and  equalize  spread 

2.  Fitting  a  line  using  least  squares 

3.  Looking  for  patterns  in  the  residuals 

tools  IfttrDduced: 

1.  Least  Squares 

2.  Residual  Plots 


XVI. II. 93 


ERIC 


.    Transfonnations  to  improve  Linearity  and  EqualiEc  Spread 


Basic  issae;  Transforming  data  to  make  a  fitted  line  a  good 
aummary 

1.  ^ired  observational  data  are  rarely  linear  in  the  raw 
form 

2.  in  addition  to  this  nonlinearity,  the  spreads  of  the 
constructed  mini^batches  may  not  be  equal 

3.  We  seek  to  transform  the  data  to: 

a.  improve  Linearity 

b.  Equalize  Spread 

4;    Both  these  goals  are  important ,  and  should  be  sought 
whenever  possible  and  necessary 


Problem:    How  do  we  achieve  these  2  goals? 

1.  Linearity  is  (usually)  increased  by  transforming 
the  X  variable 

a.  Trans forming  X  to  higher  powers  has  the  effect  of 
stretching  the  X  axls»  which  promotes  linearity  in 
plots  that  resen±>le  exponential  functions  (e^)  or  (-e^) 

b.  Transforming  X:  to  small  powers  has  the  effect  of 
shrinking  the  X  axis^  which  promotes  lineatlty  in  plots 
that  resemble  negative  exponential  functions   (e"  ) 

or  (-e-^) 

2.  Spread  is  often  equalized  by  trahsf brming  Y;  similar  to 
transformations  to  equalize  spread  with  multiple  batches 

3.  Our  conditional  typical  values  should  be  useful  in  choosing 
good  trahsf brmatibns ,  since  the  plot  of  the  values  "mimics" 
the  patterns  of  the  (X,Y)  scatterpiot 


Solution:  Use  (median  X,  median  Y)  points  from  the  three  thirds 
for  resistant  lines 

1.  We  divide  the  data  into  thirds  on  the  basis  of  the  X-values , 
keeping  each  Y  with  its  paired  X-value 

2.  We  then  have  3  sample  points 

^^(D'^d)-*'  ^'^(2)-^(2)^'  *^^(3)'^^3)^ 
which  are  the  medians  of  the  3  mini-batches 
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3.    the  effect  of  trsnsf orinatlofis  on  the  dmta  set  can  be  seen 
by  merely  trftnsformlng  the  3  san^le  points 

4  .    Seek  a  transformation  so .  that  the  slopes: 


^1 

Y 

^1 

•  Y  * 

(1? 

^2 

*(2) 

^2 
^(1) 

^2 
^(3) 

•^2 
^(2) 

are  eqtxal; 


^1 

or  1 

^2 


•  Method 


1.  Example;    Per  capita  Income  (X)  and  Infant  Mortality  (Y)  (2) 
for  nations  (2a) 

2.  Scatterpiot  shows  both  nonlinearity  (curve  has  e"*  shape)  (3) 
and  disparity  in  spread 

3i    )Q.ni -batches  of  data  are  constructed  , 

4.  Mini-batches  are  plotted  via  Parallel  Schematic  Display-- 
discreoahcies  from  ideal  situation  are  evident 

5.  Connecting  the  Conditional  Typicals  and  Hinges  is  quite  (5) 
useful  in  studying  the  relationship  of  the  raw  data 

6.  Recall  tukey's  diagram  for  determining  which  direction 
to  move  in  bur  transformations.    Transforming  Y  can  also 
help  improve  linearity 

7.  However ,  we  first  concentrate  on  transforming  X  for 
linearity.    If  necessary,  we  then  transform  Y  to  equalize 
spread  and  possibly  promote  increased  linearity 

^*    ^?_?^?^i??^^>^^^^^f         three  resistant  line  summary 
points,  and  examine  the  line  connecting  the  first  arid 
second,  and  the  line  connecting  the  second  and  third 

9.    When  the  slopes  of  these  lines  are  equal,  we  have  the 
appropriate  transformation  of  X,  and  Y 

id.    The  calculations  for  our  example:     log,  log  appears  best  (6) 

il.    Scatterplot  of  log  (infant  mortality)  vs  log  (income) 

is  very  linear  ^  ^ 

4  1  ^ 
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12.  Fitted  resistant  itrie*    Y  -  3.33  -  .59X  (8) 

13.  Residuals  are  nice  and  tight  around  zerb>  except  for  2  (8) 
large  values  (Libya  and  Saudi  Arabia) 

14.  In  conclttsidn,  we  itudy  the  effect^  on  the  8ch<matic^  p^^ 

of  the  ni.ni.«*batche8  of  the  log  X  and  log  Y  transformations 

a.  Ibg(X)  has  ahruidc  the  X  scale,  and  increased  linearity  (9) 

b.  16g(Y)  has  definitely  equalized  spread  (10)^;. 
Ci    Put  thon  both  together,  and  plot  looks  very  good  (H)  / 
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Topic  2.    Fitting  a  line  ttsing  least  sqoares 

I.    Basic  Issue:    Presentation  of  the  Least  Squares  Principle 


!•    Resistant  line  is  just  one  method  of  fitting  a  line  to 
ah  (X,Y)  point  cloud 

2.  We  prefer  it  because  it  is  resistant  to  outlying  or 
deviant  points 

3.  The  ''classical"  fitting  procedure  is  lotowh  as  ''least 
squares" 

4.  in  recent  years  least  squares  has  come  under  attack 
because  it  is  very  sensitive  to  outliers 


II .  Method 


1.  The  least  squares  principle  finds  the  line  which  has  the 
miniimnn  value  of  the  quantity:  (12) 

n  -  z 

S  (Y.  •  a  -  bX,)^ 

i«i  ^  ^ 

2.  This  line,        =  a  +  6x.  ,  minimizes  the  stm  of  the  squared 
residuals 

3.  Since  the  residuals  are  used  in  the  procedure,  they  are 
ve^  important  ih  assessing  how  well  the  line  fits 

4.  Here  is  the  gecmetrical  ihterpretatioh  of  lea^ 

Note  that  we  minimize  the  squared  distances  of  the  points 
from  the  line  (13) 

5.  The  least  squares  line  will  be  a  good  fit  when: 

a.  Data  ace  linearly  related 

b.  Spread  about  the  line  is  constant 
c •    No  outliers . 

6.  How  do  we  assess  the  fit?  (XA) 

a.  Examine  the  residual8--stan-ahd'-leaf ,  plot  vs  X 
(see  Topic  3) 

b.  Examine  variance  about  the  line: 

,2     .    ^(^1  -  ^  -  ^i^' 
y|x  n-2 
We  want  this  as  small  as  possible 


c.  Examine  one  mitiQ 6  the  ratio  of  residual  variation 
to  the  total  variation  of  Y: 

-2      ,  •  -  a  -  S^. 

r    -  1-        i  i     _  1-  ylx 

E(y^  •  Y)^ 

fhta  is  the  "percent  of  variance"  explained. 
Least  Squares  line  for  Infant  Mortality  data:  (15) 

Y  «  3.11  -  .S12X 
Slope  differs  from  resistant  slope 

Residuals  slightly  more  tight  around  0  than  with  resistant  (16) 
line 

Least  Squares  with  CMU-DAP 
Use  function  ^filEG. 
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Topic  3.    tooking  for  Patterns  in  the  Residuals 

!•    Basic  ^ssue:    What  should  the  batch  of  residuals  resonble? 

1.  Batch  of  residuals  should  be: 

Synmetric  about  serb 
b.    Devoid  of  outliers 

2.  That  is:    batch  should  be  weii-behaved 

3.  Plotted  against  X,  residuals  should  be  a  random  swarm  of 
points,  with  no  pattern 

IX.    Method:    Residual  Plots 

1.  Plot  of  residuals  C?)  ys  X  for  Infant  Mortality  data— no  (17) 
pattern  evident,  two  high  outliers  are  apparent 

2.  Patterns  to  look  out  for 

a.  Trigonocaetric  (Sinusoidal) 

b.  Sign  patterns 

c .  Wedge  shape 
d •  Linear 

e.  Curveis 

f.  Deviants 
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Lecture  3-3 
Transparency  Presentation  Guide 


^transparency  _ 

Number  ^Fraaspar^ncy  Description 

Lecture  3-3  Outline 


2  Income  arid  Infant  Mortality 
2a  Rate  for  Nations 

3  Scatterplot  of  Income  and 

Infant  Mortality 

4  Schematic  plots  of  Infant  Mortality 

5  Conditional  Typicals  for  Infant 

Mortality  data 

6  betermination  of  Trahsf brmation 

for  Infant  Mortalities 

7  Scatterplot  of  Logged,  Infant 

Mortality  data,  with  Fitted  Line 


8  Stem-and-Leaf  of  Residuals 

9  1  Schematic  Plot — X  transformed 

10  J  Schematic  Plot — Y  transformed 

11  Schematic  Plot — X  and  Y  transformed 


12  Least  Squares  principles 

13  Geometrical  Least  Squares 

1^  How  well  does  the  Least  Squares 

Line  fit 
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7«  15  Least  Squares  Line  of  Infant 

Mortality 

8.  16  Residuals  from  Least  Squares 

Lines 

Topic  3 
Section  II 

1.  17  Residual  Plot  for  Infant 

Mortality  Data 
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Lecture  3-4.    Analysis  of  Time  Series  Data 


Analysis  of  Time  Series  Data:    Smoothing  time  series  data  with  little 
Structure,  and  studying  and  summarizing  time  series  data  with  substantial 
structure 


Lecture  Content;  (1) 

1.  Smoothing  time  plots  to  remove  Irregularities,  and  Identifying 
any  periodicities  In  the  data 

2.  Fitting  lines  to  time  plots ^  and  extrapolating  and  interpolating 
apparent  trends 


Main  Topics; 

1.  Smoothing  Time  Plots 

2.  Suunmarizihg  Time  Plots 


Tools-  Introduced ; 


Running  Medians  of  3  Smoother 
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Tojplc  1.    Smbottiing  Ti^e  Plots 

1.    Basic  Zssae:.   Time  Series  Data  may  have  quite  a  bit  of  "noise" 

1.    Time  series  data  consist  of  paired  data,  where  the  X 
value  is  a  time  scalc--dayS|  months,  years,  etc, 

2i    We  generally  have  dtlly  one  Y  observation  for  each  X 
value 

3,  Such  data  sets  can  be  quite  irregular  ,  having  many  peaks 
and  troughs,  when  plotted 

4,  We  need  to  be  able  to  find  thepattern  of  the  data  (if 
present)  by  filtering  but  the  irregularities,  or  "noise" 


II •    Problem:    How  do  we  best  identify  any  patterns  in  the  data 

1.  Time  series  data  are  Cdimbnly  collected.    We  can  thinlc  of 
many  examples:    U.S.  Gross  Natibhal  Product  for  the  years 
1946-1976;  daily  reported  tiumber  of  swine  flu  .cases ,  January- 
September  1976;  Dbv  Jones  averages  in  a  30  day  peribd 

2.  need  techniques  applicable  to  all  these  Instances 

3.  .fould  like  tb  average  a  time  series  data  set  to  remove 

4.  The      src  2  fli  stihct  methods  bf  averaging 

a.  "    .Italy  Avr  rages 

b.  Averages 

5.  Monthl ,  AV>:i  'tge&  occur  when  data  are  collected  dally  and 
then  are  6U^».4ttd  br  averaged  sb  that  only  one  data  value 
is  repbrtec  tot  each  mbhth 

6.  Similarly,  w    can  average  moathly  data  to  get  yearly  data 
yearly  data  to  obtain  decade  data,  etc.  ^ 

7.  Such  averaging  is  quite  helpful  and  often  used;  however 
there  is  a  great  reduction  in  number  of  observations 
(30      1,  12  ^  1,  etc.) 

8.  Weprefer  theuse  of  running  averages^  since  such  Ibss 
does  not  occur 


III,    Solution;    "SmootEh"  the  data  by  taking  running  medians  of 
three 


1.    Smoothlnic:  has  become  quite  popular  in  the  last  10  years 
because  It  is  easy  to  do  and  is  effective 
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2.    We  have  chosen  a  very  aljnple  smoother-^nmnln^  medians 
of  3 -which  works  well  even  though  It  Is  simple 

IV.  Method 

1.  Example :    Einergehcy  Room  registrations  at  D.C.  General  (2) 
Hospital,  1970-1975.    Monthly  data 

2.  The  first  step  Is  always  tnake  a  time  plot  of  the  data  (3) 
a.    Note  the  many  peaks  and  troughs 


b.    Data  appear  to  Increase  until  1973,  then  fall 


c.    Vezy  difficult  to  compare  years  because  of 
Irregularities 

3.    We  shall  smooth  the  data  to  remove  these 

a.  Write  down  the  data  Iti  one  column  on  the  left  (4a) 
margin  of  a  page  (4b) 

b.  Take_3  values  consecutively  and  record  their  median; 
for  the  second  data  value  8120,  record  med(7476,  8120, 
7706)  =  7706 


c.  Continue  through  the  data,  taking  3  at  a  time 

d.  .   Endpolnts?    Merely  copy  the  end  value.    Tukey  has  other 

suggestions 

e.  Continue  the  smoothing  until  the  ith_«mooth  is  iden- 
tical to  the  (1-1) St  smooth.    These  data  required  3 
smoo  uti  s 


f .    From  cue  smooth  to  the  next,  we  need  only  record 
those  V'ulues  that  change 


g.    Plot  thr  .soothed  data,  and  study  it 

1.    Hospitcl  data  similar  from  year  to  year 

11.    Rises,  pp^^fks  in  sumner,  then  falls 

ill.     1973  diiciactly  higher;  1971,  1975  distinctly  (5) 
lower 

Is  this  the  whole  story?    Suppose  we  consider  the  total  (6) 
msnber  of  registrations,  and  divide  to  obtain  %  of  all 
registrations  that  are  emergencies 

a*    Thes      Ata  are  more  similar,  and  their  plot  has  less  (7) 
patt 
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b.    We  smooth  these  percentages  and  plot  them  (8) 

1.    Shape  is  similar  to  before;  jpeak  in  summer,  low 
In  winter 


it.    Hence  conclusions  are  similar  to  cdhclusiotts  from 
raw  data 


Notethat  these  data  had  little  trend    or  linear  increase; 
we  couidnot  fit  a  line  to  them.    In  the  next  section,  we 
analyze  data  with  more  pattern 

Smoothing  with  CMU-DAP; 
Use  function  SMOOTH 
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Topic  2.    Sunsnarislng  Titne  Plots 

ii    Basic  issue:    Describing  the  trends  in  timi  series  data 

1.  After  smoothing  the  data,  if  there  is  evidence  of  time 
trends,  we  should  transform  the  data  (if  necessary)  and 
fit  a  line 

2.  Specific  issues  are 

a.  Extrapolation:    can  we  say  anything  about  the  data 
beyond  the  range  that  we  have? 

b.  Interpolation:    Can  we  estimate  a  Y  value  for  a  time 
point  lying  between  2  time  points  for  which  we  have 
data? 

3.  Data  with  substantial  structure  are  much  easier  to  extra- 
polate and  interpolate  than  data  that  are  mostly  noise 

4.  Are  there  any  monthly,  seasonal,  etc.  trends?    These  are 
called  periodicities,  and  if  present,  should  be  noted 

5.  Finally,  how  does  (Xj,V^)  relate  to  (X^.j^,  Y^^j)? 

il.    Problem:    Are  there  any  problems  unique  to  time  series  data? 

1-    With  only  one  _Y  value  for  every  X,  equally  spaced  X's, 
trends  are  much  more  evident  than  with  ordinary  (X,Y) 
data 

2.    The  study  of  periodicities  and  extrapolation  and 

interpolation  presents  no  difficulties;  however  one 
must  usjE  caution,  because  drawing  conclusions  from  a 
data  set  is  a  "delicate"  matter 


III .  Methods 


1.  We  study  another  ^xam^le:    per  capita. expenditures  for  (^) 
household  electricity  in  the  U.S.,  1929-1972.    Data  are 

in  hundred  $ /person 

2.  Data  reveal  an  exponential  trend 

3.  Take  log  (Y)  and  find  a  reasonably  linear  trend  (10) 

4.  Fitted  line  has  equation:  (11) 

log  Y  -  G.06  X  -  7.29 
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Note  cyclical  pattern  of  residuals  from  line--this  pattern 
is  an  ex£ttnple  of  a  20  year  period:    high  in      1935,  low 
in  1945,  high  in  1955 i  low  in  1965 

Residuals  as  a  schematic  plot  look  fine  (12) 

However,  plotted  against  X,  the  periodicity  is  revealed!  ^13) 

Extrapolation  Is  reasonably  easy,  because  the  line  is  an  (li) 
adequate  summary 

Interpolation  is  also  straightforward,  but  remember 
periodicities ! 

A  further  example;  Niiber  of  physicians  in  the  United  (14) 
States,  1850-1973 

a.  Data  on  U.S.  population  reveal  that  increase  has  not 
been  constant  with  the  population 

b.  Suppose  we  plot    (Xi-Yi)  as  a  function  of  (X._^i^.  ,): 
we  let  bur  Y  variaBle  be  Y,  and  our  X 

variable  Y^^^ 

c.  Plot  is  amazingly  linear  I  (15) 
1,     Y(t-l)  can  be  used  to  make  good  predictions  of  Y(t) 

ii.    The  regression  of  Y(t)  on  Y(t-l)  is  a  "lagged 

regression" i  and  the  knowledge  that  this  regression 
is  good  is  quite  useful 
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Homework ,  Unit  3 


1.    The  ibllc.:U>r  i-^iva  were  collected  for  a  study^  ^  ^  oected, 

desired^  tfr  -O  X^ieal  family-Size;    The  data  below  show  the  total  number 
of  births  expected  by  married  women  18-39  years  old  by  age  group  and 
race  for  the  years  1965-1972; 


(a)    Consider Ing  each _age-group-as  a  separate  batchy  construct 

parallel  stem  and  leaf  displays  for  cachof  the  four  batches. 
Calculate  the  five  number  summaries  fox  each  batch  and  dis- 
play these  under  the  corresponding  stem-^and-leaf  display; 
What  patterns  (if  any)  do  you  observe? 

(t>)    Draw  parallel  schcunatic  plots  for  the  four  batches;    Do  you 
find  any  additional  differences  in  expected  births? 

(c)  Redraw  the  parallel  schematic  plot  and. connect  the_ medians^ 
hinges,  and  extremevalues  on  this  graph  as  done  with  the 
ordered  multiple  batches  in  class«    What. can  you  learn  from 
the  plot?    Which  graphic  presentation  ((a),  (b),  or  (c)) 
makes  these  patterns  (of  lack  of  them)  ^bst  obvious)? 

(d)  Look  again  at  the  raw  data.     Suggest  twQ_ other  possible  ways 
in  which  the  data  might  be  examined.     -Hint:    Abov'^-  we  ex- 
plored a  relationship  between  expected  Uifths  and         of  wives. 
What  other  possibilities  are  suggested  ^/  rhc  raw  dc-  a'^) 

(e)  What  do  you  conclude  about  the  relationsliip  between  expected 
births  and  a  married  woman  ' s  age?    What  i^iaplications  can  you 
draw  from  your  analysis  regarding  the  demand  for  elementary 
school  teachers  in  the  next  16  y^rs? 
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f^^^z^^^^^"^'^^^^  ^^"^  ^^^^^tt^^  number  of  employees  on  nor. -agricultural 
payrolls  ^cr  th^  y^ats  1951-1973  for  the  11  states  comprising  the 
••old  south, •• 

(h)    Consider  each  year  as  a  batch.    Calculate  the  5  hvunber  Bummary 
for  each  batch.    You  may  wish  to  cut  the  data  values  to  tens 
of  thousands.     If  ypu  order  the  data  (cut  or  raw)  what  unusual 
fact  stands  out?     (Hint:     identify  the  ordered  data  by  state). 
Why  might  this  be  so? 

(b)    Draw  a  parallel  schematic  plbtfbr  the  twelve  batches.  What 
tr^iuds  (if  any)  do  you  observe? 


(c)    How  does  the  hiimber  of  employees  in  nbh«agrlculture  jobs  change 
over  time?    What  imp licat ions  does  your  analysis  have  for  the 
employment  structure  of  the  "did  south?" 
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(i)    Total  naober  of  Births  Ejcpected  by  Married  Wbmsn  18-39  Years  Old 
by  Age  Group  and  Race  1965  to  1972 


Age  Group  (yrs.)  21  27  32  3? 
Tear  &  Race 
White 

1972  2.2  2,4  2.8  3.2 

1971  2.4  2.6  2.9  3.2 

1970  2.6  2.7  3.0  3.2 

1967  2.9  3.0  3.2  3.2 

1965  3.1  3.3  3.5  3.3 


Black 

1972  2.4  2.8  3.7  4.0 

1971  2.6  3.1  3.7  4.2 

1970  2.9  3.2  3.8  4.1 

1967  2.8  3.4  4.3  4.2 

1965  3.4  4.0  4.4  4.1 
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#BBploy«cs  cm  Kbh-agrlculturil  payrolls  by  state  for  the  "old  south  %  in  thbiissnds 


SUte/Year 

1951 

1953 

.  1955 

1957 

1959 

1961 

1963 

1965 

1967 

1969 

1971 

1973 

AlitMS 

662.8 

692.7 

702.9 

754.8 

764.4 

774.6 

812.5 

886.5 

951.8 

iodb.2 

1021.9 

1135.6 

Aricanssi 

319.0 

319.6 

321.0 

337.4 

359.4 

376.0 

414.9 

455.3 

497.9 

530.7 

549.2 

619.9 

Florida 

759.7 

848.8 

965.9 

1152.7 

1273.0 

1333.9 

1447.4 

1619.1 

1816.4 

2069.9 

2249.2 

2756.5 

Georgia 

872:3 

929.7 

959.5 

997.4 

1030.1 

1050.7 

1139.7 

1257.1 

1394.7 

1531.7 

1602.9 

1799.7 

Louisiana 

669.5 

711.4 

725.5 

602.6 

789.1 

780.6 

817.0 

905.5 

1005.0 

1041.0 

1064.3 

1172.9 

Misiiisippi 

333.7 

344.1 

354.0 

366.9 

397.2 

408.7 

443.7 

485.3 

531.9 

567.0 

593.5 

678. 5 

North  Carolina 

987.2 

1023.7 

1059.4 

1101.3 

1163.7 

1209.1 

1298.6 

1431.2 

1600.9 

1747.0 

1818.4 

2014.7 

South  Carolina 

5D5.6 

543.8 

533.0 

545.0 

566.8 

587.0 

630.6 

686.1 

754.4 

819.8 

662.6 

984.0 

Tennessee 

605.9 

852.6 

867.6 

886.8 

907.0 

934.0 

1002.5 

1108.5 

1218.8 

1309.8 

1356.8 

1534.6 

Texas 

2103.5 

2224.7 

2291.2 

2450.2 

2513.0 

2544.1 

2700.1 

2925.3 

3251.7 

3599.2 

3692.1 

4146.4 

Virginii 

869.4 

903.2 

912.0 

972.0 

1000.5 

1034.8 

1123.8 

1218.9 

1330.2 

1438.1 

1558.0 

1747.4 

ERIC 


3,    The  fpllovihg  data  show  total  school  hudgets  for  variotxs  towns  in 

the  Fittsburgh  area  and  the  total  numher  of  pupils  in  the  corresponding 
school  .systems.     (The  Pittshu^h  district  is  deleted  since  it  is 
considerably  larger  than  any  of  the  others), 

(a)  Considering  the  number  of  pupils  as  the  X  variable^  make  a 
scatterplot  of  the  data.    What  can  you  learn  from  the  plot? 

(b)  Group  the  X  values  into  mini-batches.     (One  possible  grouping 

^  would  be  0>1^9^  2.0-2.9,  3.0r3. 9,  4^6-5.9,  6.G+._  Draw  a  parallel 
schematic  plot.    What  patterns  (if  any)  do  you  observe? 

(c)  Determine  the  conditional  typical  values.    Plot  these  values 
oh  a  separate  graph.    What  trends  (if  any)  do  you  see?  Kow 
well  do  the  conditional  typicals  sonunarize  the  inforaation 
in  the  mini-batches? 

(d)  Calculate  the  residuals  frbin  the  conditional  typicals  and  analyze 
as  a  single  batch.    Do  the  residuals  indicate  any  "lack  of  fit?" 

(e)  What  relationship  between  school  budget,  and  number  of  pupils 
does  your  analysis  suggest? 
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Toeit  School  Budget    Kill  Tottl  Nuaber  of  Pupils 
For  Area  School  Dlitrlcta 


District 

Budget  • 

Puplii 

Xlligheny  Valley 

$  4,866,760 

2,415 

Avofworth 

3,185,000 

1,865 

Babcock 

4,795,500 

2,922 

BalMn^hiCehall 

14,248,615 

6,515 

Bethel  Park 

15,001,891 

8,776 

Brentwood 

3,284,971 

1,807 

Carlynton   

5,288,313 

2,910 

Char tiers  Vllley 

10,658,743 

5,715 

Churchill 

8,468,331 

4,799 

Clalrton 

3;430,498 

1,956 

COrtiett 

3,408,583 

1,416 

Deer  Likai 

4.937,707 

3,082 

Duquesne 

2,514,094 

1,530 

East  Allesheny 

6,310,000 

3,479 

Edgevbod 

ii 655, 497 

:  859 

Eiisabeth-Fotvard 

7,360,243 

5,071 

Fcx  Chapel 

11,876,197 

6,074 

GaCcvay 

14,717,000 

8,508 

General  Braddbck 

4,851,796 

2,668 

H*>pton  _ 

5,470,700 

3,229 

Highlands.  _ 

8,781,000 

5,338 

KeysCdhe  Oaks 

8,146,801 

4,930 

District 

BQdget 

Pupils 

McKeesport 

$11,825,0^5 

8,139 

Nbhtbur 

7,436,496 

4,269 

Noon 

9,098,485 

5,133 

Mt .  iebaoon. 

14,992,283 

7,815 

North  Allegheny 

13,308,965 

7,665 

North  Hills 

13,793,726 

8,104 

Northgste 

4^822,199 

2,693 

Pent!  HI  hi 

20,242,552 

13,480 

Plum 

8,997,000 

6^020 

Quaker  Valley 

4,984^400 

2,406 

Biverviev 

3,382,934 

1,986 

Shaler 

14,636,498 

9,205 

South  Allegheny 

4,522,672 

3,171 

South  Fayette 

2,502,741 

lp263 

South  ^rk 

3,932,238 

2,382 

Steel  Valley 

6,915,120 

3;514 

Sto-Rox 

5,185^295 

2,999 

SvtsBvale 

3,686,749 

2,040 

Turtle  Creek 

2,739,724 

1,505 

yPP«_St.  Clair  : 

9,862,247 

5,349 

West  Allegheny 

5,126,134 

3,410 

UeiC  Jefferson  Hills 

7,093,055 

4.551 

West  Mifflin 

9,706,972 

5^808 

Ullklnsburg 

6,496,542 

3,570 

Source:    PltCshurgh  Press,  July  10,  1976. 
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4.    the  foilovtng  data  show  the  violent  crime  index  rate  per  100,000 

population  for  various  American  urban  rapid  transit  systems  in  1970. 

(a)  Scatterplot^ the  data  (consider  population  the  X  variable). 
What  pattern  (if  any)  do  you  see? 

(b)  A  common  transformation  for  populations  is  log.    Looking  at  the 
raw  data;  do  you  think  that  this  is  an  appropriate  transformation? 
Why  or  why  not? 

(c)  Scatterplot  the  transformed  data.    What  pattern  (if  any)  do  you 
see? 

(d)  Fit  a  resistant  line  to  the  transformed  data  and  plot  it  on  the 
scatterplot  from  (c).    Is  it  a  good  fit? 

Calculate^  plot  (vs  X)^  and  examine  the  residuals.    What  do  they 
tell  you  about  the  fitted  line? 

Polish  the  line  once.    Plot  the  polished  line  on  the  scatterplot 
from  (c) .    Does  it  appear  to  be  s  better  fit  than  the  line 
fitted  in  (d)?    Calculate;  plot,  and  examine  the  residuals 
from  the  polished  line; 

Compare  this  residual  plot  to  that  in  (e).    Which  shows  a 
better  fit? 

What  conclusions  can  you  draw  about  the  relationship  between 
violent  crime  rate  on  transit  systems  and  city  population? 

i  ^ 


4sr, 


(e)  ' 

-g) 
(h) 
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Violent  Crime  Index  Ret e  per  100,000  Populetion  for 
Verioue  Aaericen  City  Rapid  Transit  Systesui  in  1970 


City/Syetem       Violent  Criine  Index  Rete       Center  City  Pop.  tog  .Center  City  Pop. 


Boston 

o2o,2l3 

D .  /  70 

1  f577 

>k  ftAQ 
3  .  OD7 

ife>roi>«> 

1  -QCQ 

1 ,4?2,914 

3 . 124 

ueKisna 

A 1 Deny 

280,032 

5 .447 

Atlente 

_  77? 

497,426 

5 .697 

D  •  73a 

Cbltinbiis 

540 

540,025 

5.732 

DCRetro 

2,203 

746^69 

5.873 

Denver 

822 

512,691 

5.710 

Ft.  Worth 

453 

393,476 

5.595 

Indienepolis 

48X 

746,613 

5.873 

Milveukee  : 

213 

709,537 

5.851 

New  Orleans 

:1,066 

593  ;471 

5.773 

Portland 

708 

375,161 

5.794 

St^  tpuis:  _ 

1^502 

622,236 

5.794 

San  Antonio 

550 

650,188 

5.813 

San  Diego 

286 

695,790 

5.842 

Seattle. 

603 

524^623 

5.220 

Ann:  Arbor 

394 

99,797 

4.999 

Billings 

153 

52,851 

4.723 

Chattanooga 

579 

113^003 

5.053 

Concord 

_  30 

30,022 

4.477 

Dayton 

1,200 

239,591 

5.379 

Everett 

229 

51,926 

4.715 

Lafayette 

i71 

44,955 

4.653 

Orlando 

792 

97; 565 

4.989 

Pueblo 

271 

96,746 

4.986 

Schenectedy 

162 

77,134 

4.887 

Syracuse 

352 

197,297 

5.295 

Tacdma 

408 

151,061 

5.179 

4SG 
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5.     The  following  datft  ibdw  the  niiniber  of  reported  cases  of  mumps  per 
month  for  1972  and  1973. 

(a)  Scatterplot  the  data  (remenier .  the  X  variable  here  is  Tim)  . 
What  trends  (if  any)  do  you  sec? 

(b)  Smooth  the  data  with  running  medians  of  3. 
(c5    Discuss  the  periodicity  of  the  time  scries. 

{d)    How  many  cases  of  muips  do  you  think  wercrcported  in  March  1974? 
August  1974?    Deceibe?  1974?    February  1975?    Septenfcer  1975? 

(e)    What  implications  can  you  draw  from  ybiir  analysis  regarding 
when  during  a  year  spot  commercials  should  be  run  on  TV  to 
convince  parents  to  get  mumps  vaccine  shots  for  children. 


Reported  Cases  of  Humps  in  the  US 


1972 

1973 

January 

9184 

7160 

February 

8921 

7349 

March 

10806 

8306 

April 

9663 

6434 

May 

9929 

7404 

June 

5483 

5045 

July 

2634 

2030 

August 

1799 

1357 

September 

i480 

1068 

October 

2641 

2456 

November 

5418 

4759 

December 

6205 

5751 

Source:    Center  for  Disease  Cbhtrbi,  H&rbtdi.ty  &  Mortality.  Vol.  23. 
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The  fdlldvlhg  data  glve  medlah  ihcbmes  of  persons  25  years  old  or 
older  by  years  of  school  completed  and  by  sex  for  the  US  in  1973. 


(a)  Make  a  scatterplot  of  root  median  income  (Y)  against  years  of 
school  completed  (X).    What  relationship  (if  any)  do  you  observe? 

(b)  Calculate  the  regression  line  for  income  vs.  years  in  school 
for  >^  ONtY.    Plot  this  line  on  the  scatterplot  you  drew  for 
(a) .    Does  this  line  confirm  or  contradict  the  relationship 
you  observe  in  (a)?    Calculate  and  plot  the  residuals.  What 

do  you  conclude  about  this  model  from  the  plot  of  the  residuals? 


(c)  Calculate  the  regression  line  for  income  vs.  years  in  school 
for  WOlffiM  GNI.Y._  Plot  this  line-on  the  scatterplot  you  drew  - 

(?) •  ^L^°*'  this  line  confirm_or  contradict  the  relationship 
you  observe  tn  (a)?  _Calculate  and  plot  the  residuals .What 
do  you  conclude  about  this  model  from  the  plot  of  the  residuals? 

(d)  Calculate  the  regression  lir*  for  income  vs.  years  in  school 
for  BOra         ASD  Wb^ffiS  (i;e^^  c»btne^ll  the  data  into  one 
batch) ;    Plot  this  line  on  the  scatterplot  yoa_ drew  for  (a); 
Calculate  and  plot  the  residuals.    How  ^SrelT'  does  this  line 
fit  the  data?    Do  you  pr*ai  :^r  this  single  line  or  the  two 
individual  lines  found  in  (b)  and  (c)  ,  and  why?    How  Important 
is  the  person*  s  sex  in  fit.i:ji.ng  a  model  to  this  data? 


(e)    What  policy  implications  do  you  derive  from  your  analysis 

concerning  the  differential  status  of  men  and  women  in  the  US? 


Median  Annual  Income  of  Persons  25  years  old  and  over 
by  Years  of  School  completed  and  by  Sex  for  the  USA  1973 


Years  of  School  -Median  Income 

Completed  Men  Women 


$4463 

$1873 

8 

6371 

2220 

10 

8622 

2836 

12 

10832 

3970 

11670 

4564 

16 

13939 

62i4 

20 

16027 

8936 

Source:     y.S.  Dept.  of  COTmerce,  Bureau  of  Census  Current 
Po^latlon  Reports.  Series  P-60j  No.  97.  '  
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7*    The  follovlng  dsts  ihow  the  numbK  of  reported  uses  of  Venereal 
Diiease  (Gonorrhea  and  Syphilii)  per  year  for  1962-1974, 

(a)  Scatterplpt  the  data  (remember ^_ the  X  variable  here  ii  TIME). 
What  trendi  (if  any)  do  you  iee? 

(b)  interpolate  to  find  the  number  of  cases  In  1970  of  Syphilis 
arid  Gonorrhea. 

(c)  Plot  Y^t)  vs  Y(t-l).    What  does  this  plot  tell  yoii? 

(d)  How  many  cases  of  gonorrhea  do  yoii  estimate  will  be  reported 
in  1976?    Hov  many  of  syphilis? 

(e)  If  you  were  designing  hew  public  health  programs  to  reduce 
the_  incidence  of  venereal  disease  vhat  cbtx^arative  emphasis 
ybuld  ybu  mafce  over  the  next  ten  years  regarding  syphilis  and 
gonorrhea?    What  aspects  of  your  analysis  vbiild  yoii  tise  to 
conyihce  a_ group  of  concerned  lay  people  of  the  cbrrectehess 
of  ybur  policy? 


Reported  Cases  of  VD  per  year  for  1962-1974 

Year  #  cases  syphlils  #  cases  gonorrhea 

(in  IbbOs)  (in  iOOOs) 


1962 

126 

264 

1963 

124 

278 

1964 

114 

361 

1965 

113 

325 

1966 

105 

352 

1967 

102 

465 

1968 

96 

465 

1969 

92 

535 

1971 

96 

670 

1972 

91 

767 

i973 

87 

843 

1974 

84 

899 

e:    Center  for  Disease  Control ^  MbrSldttv  &jtortaiity  vols.  20 ^  23  #53 
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1.  #  Births  expected  by  wives  18-39  years  old  by  age. 
(»)    21  yrs.  old  27  ^s.  old  32  yrs.  old 


Unit 

s  .1 

Bnit 

-  .1 

Unit 

«  .1 

2 

244 

2 

4 

2 

66899 

678 

89 

3 

14 

3 

01234 

3 

02 

5778 

4 

4 

0 

4 

34 

2. 

7 

1  3. 

05 

3. 

6 

2.4 

2.9 

2.7 

3.3 

3.0 

3.8 

2.2 

3.4 

2.4 

4.0 

2.8 

4.4 

Apparent  trend 

towards  Increase  ±n  exoeetpd  hirff-hs 

(See  plot)    Sane  trend  ts  more  apparent. 

(Rate  of 

37  yrs.  old 
Unit  -  .1 
2 


22223 
01122 


3.65 
3.2  4.1 
3.2  4.2 


(c)     (See  plot)    Iheoplot  with  the  connecting  lines  shows  the  same  trends.  It 

i?  a  matter  of  opiniqn^tch  plot  makes  these  trends  most  obvious;  probably 
the  best  case  can  be  made  for  plot  (c),  with  the  connecting  lines. 

(4)        Expected  births  vs.  race  (Black  >  White  trend). 
/      E9q>ected  births  vs.  year  (Decreaaw  with  year). 

(e)    As  a  woan'i  age  increases  (and  also  the  nuaaber  of  children  she 

xlreaay  ha»)the  total  number  of  children  she  cx^c^s  to  havelnc^eases . 

Since  we  do  hot  know  from  these  noobers  imether  the  number  of  children 
of  elementary  achool  age  is  increasing  (over  time)  we  cannot  use 
these  numbers  for  predicting  the  demand  for  elementarjr  school  teachers 
tn  the  future. 


But  note  that  expected  family  size  is  certainly  less  in  1972  than 
In  earlier  years J  hence ^  the  number  of  elementary  echool  age 
children  will  probably  decrease  in  the  future,  causing  a  decrease 
In  demand  for  teachers. 


QMPH 


i^b)    Nunber  of  births  eacpected  by  wives  18-39  years  old 
by  ege 


Nuniber 

6f 
birtfis 
expected 


i 

I 
I 


21 


27 


T' 

I 

i 

i 
i 


32 


37 


age  of  wife 


^91 
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2<a)   #  Enployeet  on  ^on  agricultural  payroll!  valuca  ordered  and        (tena  of  thbuiahda) 


Year 

195i 

1953 

1955 

1957 

1959 

1961 

1963 

1965 

1967 

1969 

1971 

1973 

State 

extresie 

31 

31 

32 

33 

35 

37 

41 

45 

49 

53 

54 

61 

»rk) 

33 

34 

35 

36 

39 

40 

44 

48 

53 

56 

59 

67 

(Mias) 

hinge 

50 

54 

S3 

54 

56 

58 

63 

68 

75 

81 

66 

98 

(SC) 

66 

69 

70 

75 

76 

77 

81 

88 

95 

100 

102 

113 

(Ala) 

66 

71 

72 

80 

78 

78 

81 

90 

100 

104 

106 

117 

(») 

■edian 

75 

84 

86 

88 

9b 

93 

100 

lib 

121 

130 

135 

153 

(tenn) 

80 

85 

91 

97 

100 

103 

112 

121 

133 

141 

155 

174 

(¥a) 

hinge 

86 
87 

90 
92 

95 
96 

99 

lib 

103 
116 

105 
120 

113 
129 

125 
143 

139 
160 

153 
174 

160 
181 

179 
201 

(Gi) 
(NC) 

98 

102 

105 

115 

127 

133 

144 

161 

181 

206 

224 

275 

(Fla) 

wtrene 

210 

222 

229 

245 

251 

254 

270 

292 

325 

359 

369 

414 

(Tx) 

5  niiBl>cr  fluenarice 


extreae 

31 

31 

32 

33 

35 

37  _ 

41 

45 

49 

53 

54 

61 

Hinge 

56 

61.5 

61.5 

64.5 

66 

67.5 

72 

78 

85 

90.5 

94 

105.5 

median 

75 

84 

86 

88 

90 

93 

100 

110 

X21 

130 

135  - 

153 

hinge 

86.5^ 

91 

95.5 

104.5 

109.5 

112.5 

121 

134 

149.5 

163.5 

170.5 

190 

extrae 

210 

222 

229 

245 

251 

254 

270 

292 

325 

359 

369 

414 

iidipraid 

28.5 

29.5 

34.0 

40.0 

43.5 

45.0 

49 

56 

64.5 

73.0 

76.5 

84.5 

3/2  H 

43 

45 

51 

60 

65 

67 

74 

84 

96 

lb9 

115 

127 

Note  that  relative  raoiang  of  itatea  rmaine  nmt  (possibly  reflects  toUl  population)  each  year. 
Alio  that  spread  increaw. 
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"f  teploy— »  OP  IKil  agriciilturai  paTrplli 
(upper  vhlskari  and  adj.  ▼mloM  |iot  ibmm) 


400 


30Q± 


 # 

itmm  of 
tbbbiand) 


200 


iod 


Ark 


51       53       55       57  59 




€1  lis 


65  67 


I  ■ 
69 


-«  H- 

7i  73 
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2(b)  Note  that  the  #  ei^loyccs  onnon* agricultural  payrolls  increases 
over  time  for  all  of  the  states  (in  fact  the  i^^ployees  on  non- 
agricultural  payrolls  roughly  doubles  between  1951  arid  19735. 


(e)    See  also  (b)  above.    Since  the  rise  in  #  employees  oh  hohagricultural 
payrolls  may  be  due  "only"  to  thcgeneralrise  in  population,  we 
have  insufficient  data  to  draw  conclusions  about  the  changing 
emplpyinent  structure  (if  indeed  it  is  changing)  in  the  old  south. 
We  might  wish  to  examine  possible  changes  by  comparing  the  annual 
rate  of  increase  of  #  employees  onnonagricultural  payrolls  with 
either  rate  of  increase  of  total  population  (by  state)  or  rate  of 
increase  of  #  employees  on  agricultural  payrolls,  or  even  both. 
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-52.5 

-2 

43 

-i 

51521 

-0** 

6649ef534l94 

1 

0122260134081403477 
0402 

0 

2 

Hi 

60;O 

5  Number  Sunaaary  of  Residuals 


i 

12 
23 
12 

i 


E 
H 
M 
H 
E 


-52.5 
-6.0 
0 

4;e 
60;  0 
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Mlnlbfttchcs,  order td  &  0unDarls«d 


Batch 
i  pupili 
Mdi«n_#_puplli 
(thouMndi) 

budgets 
(ordered) 


I 

0-1.9 
1.0 


16.5 
25.6 
25.1 
27.3 
31.8 
32  8 
33.8 
34. D 
34.3 


_  2- 
2.0-2.9 
2.S 


36.8 
39.3 
47.9 
48.2 
48.5 
48.6 
49.8 
51.8 
52.8 


_  3 
3.0-3.9 
3.5 


45.2 
49.3 
51.2 
54.7 
63.1 
64.9 
69.1 


4 

4.0-5.9 
5.0 


70.9 
73.6 
74.3 
81.4 
84.6 
87.8 
90.9 
97.0 
98.6 
106.5 


cxtr^e  16.5  36.8  45. 2_  70.9 

hinge  25.1  47.9  50.25  74.3 

median  3i.8  48.5  54.7  86.2 

htnge  33.8  49.8  64.0  97,0 

extreme  34.3  52.8  69.1  106.5 

ntdtpread  8.85  7.2  13.75  22.7 


Reilduali  from  Medians : 


-  :  -15.3 

-15.3  -11.7  -9.5  -12.6 

-  6.8  -  9.2  -9.5  -11.9 

-  6.7  -  C.6  -5.4  -  4.1 
.  4.5  -  0.3  -3.5  -  1.6 

0                      0  0  t  ^'^ 

+  l.b  +  b.l  +8.4  +  4.7 

+  2.0  +  1.3  +10.2  +10.8 

+  2.2  +  3.3  +14.4  +12.4 

+  2.5  +  4.3  +20.3 


5 

6.0-up 
8.1 


89.9 
118.2 
118.7 
133.  Q 
137.9 
142.4 
146.3 
147.1 
149.9 
150.0 
202.4 

89.9 
125.7 
142.4 
U8.5 
202.4 

22.8 


*52.5 
-24.2 
-23.7 

-  9.4 

-  4.5 
0 

+  3.9 
+  4.7 
+  7.5 
+  7.6 
+60.0 
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4.     (a)    the  crime  rate  index  increases  with  center  city  popuiatioti. 

There  is,  however,  quite  a  bit  of  variation- -even  within  cities 
of  simiiar  size. 

(b)  The  data  have  a  shape  wljich  suggests  that  we  should  transform 
X  down.    Log  (X)  is  riot  unreasonable. 

(c)  The  crime  rate  index  increases  with  log  center  city  population 
in  a  very  linear  fashion. 

<d)    The  unpolished  resistant  line  is 

Crime  =  -4381.6211  -f  (945.2319J  (Log  Population) 

(e)  A  very  noticeable  trend  iri  the  residuals  is  that  the  absolute 
value  of  the  residual  increases  with  the  size  of  the  log  center 
city  population.     (Note  the  increasing  divergence  of  the  data 
from  the  fitted  line  as  Ibg(pbp)  increases.     Otherwise  the  fit 
is  "pretty  good") 

(f)  The  once-polished  resistant  line  is 

Crime  «  -3629.6418  -f  805.4307  (Log  Population) 

there  is  a  slight  difference  between  the  unpolished  and 
polished  lines,  but  the  fit  appears  to  be  about  the  same. 

(g)  Same  consnents  as  (e).    the  two  residual  plots  appear  to  be 
very  similar. 

(h)  the  violent  crime  index  rate  on  transit  system  increases  with 
population.  This  relationship  can  be  described  approximately 
by  the  fitted  model  (resistant  line) 

Crime  =  -3629  -f  805  (Log(pop)). 

the  model  fits  best  for  areas  with  smaller  populations. 
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5m     (k)      There  are  fewer  cases  of  mumps  in  1973  than  in  1972,  and  there 
are  monthly  or  seasonal  effects. 


Oi)  .    Repoftied  Cases  of  Momps. 


Month 

Data 

Smoothed  Onc« 

J  72 

9,184 

9,184 

F 

8,921 

9,i84 

M 

lb;8b6 

9,663 

A 

9,663 

9,929 

M 

9 ,929 

9,663 

j 

5,483 

5,483 

j 

2,634 

2,634 

A 

1,799 

1,799 

S 

1,480 

1,799 

6 

2  641 

2,64i 

N 

5 ,418 

5,418 

S 

6,205 

6,204 

J  73 

7,160 

7,160 

F 

7,349 

7,349 

M 

8,306 

7,349 

A 

6,434 

7,404 

M 

7,404 

6,434 

j 

5  ,045 

5,045 

j 

2,039 

2 ,039 

A 

1,357 

1,357 

S 

1,068 

1,357 

b 

2,456 

2,456 

N 

4,759 

4,759 

5,751 

4,751 

(c)    Lows  occur  in  sunsner  months  (July,  August,  September ,  (October)) 


Highs  occur  in  winter/spring  months  (Deceixd>er,  January,  February, 
(March) ,  (April). 

Note  the  decreasing  overall  trend* 

(d)  March  74  ^  7000  Feb.  75  ^  5D0O(?) 
Aug.  74  ^  1100  Sept.  75  1000(?) 
Dec.  74.  ^  5000 

(e)  One  should  run  commercials  a  month  or  two  prior  to  the  expected 
onset  of  mumps  (to  allow  time  for  parents  to  get  their  children 
vaccinated  arid  to  allow  time  for  the  children  to  develop  the 
iinnunity  from  the  disease. 


The  data  suggest    midsummer  (July)  for  these  spots  to  begin,  with 
an  exhortation  to  get  children  vaccinated  (before  school  starts 
in  September).  ^  ^ 
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6  a)    The  scstterpiot  suggests  a  trend  of  increasing  median  income  with 
increasing  education;    If  we  code  the  points  for  men  and  women, 
we  can  readily  see  that  men's  incomes  are  greater  than  women's 
Incomes  at  each  level. 

b)    Root  (Median  Income  ^eh))  *  52.20  +  3.94  (Years  of  School  Com- 
pleted) 

The  line  confirms  the  Increasing  trend  iii  men's  median  Income 
^^^y^s^^:^^  school  completed.     It  also  accents  the  apparent 
differences  In  men's  and  women's  salaries  for  comparable  levels 
of  education. 


Residuals:      -I .16 
-3.9b 
1.25 
4.6b 
.67 
2.82 
-  .44 

there  is  no  noticeable  pattern  in  the  plot  of  the  residuals-  The 
least  squares  line  seems  to  fit  quite  well. 


c)    Root  (Median  income  ,  . 

j\  (women) 
pleted) 


)  =  23.55  +  3.37  (Year  of  School  Com- 


the  line  confirms  the  increasing  trend  in  women's  median  income 
with  years  of  school  completed.    The  slope,  being  less  than  the 
slope  of  the  line  for  men,  indicated  that  the  average  additional 
Income  from  one  more  year  of  education  is  lesis  than  that  for  males. 


Residuals: 


6.25 
-3.39 

-4.  bo 

-  .98 

-3.17 
1.36 
3.58 


There  is  no  noticeable  pattern  in  the  plot  of  the  residualis.  The 
least  squares  lines  seems  to  fit  well  but  the  residuals  are  larger 
than  in  the  case  of  men's  incomes. 


d)    Root  (Median  Income  ,  -^.) 
pleted) 


37.87  +  3.65  (Years  of  School  Com- 


Residuals: 


Men 
14.82 
12.73 
i8.45 
22.38 


Women 
-  9.2 
-19.97 
-21.15 
-18.69 
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19.02 
21.75 
15.68 


-21;45 
-17; 48 
-16.39 


The  residuals  are  strongly  positive  for  those  points  correspond- 
ing to  men -8  incomes  and  strongly  negative  for  those  points 
corresponding  to  women *s  incomes.    The  two  lines  clearly  pro- 
vide a  superior  summary  of  the  data.    However*  if  we  are  asked 
to  predict  average  income*  regardless  of  sex,  we  would  want  to 
use  the  regression  line  for  both  sexes. 

If  we  are  given  the  sex  of  ah  individual  we  can  predict  median 
income  with  much  greater  accuracy  than  if  we  are  forced  to  use 
the  equation  derived  from  the  combined  data  (barring  bllid  luck). 
Compare  the  sums  of  the  absolute  values  of  the  residuals  from  the 
two  regression  lines  as  opposed  to  the  one:  41.53  versus  238.16. 

e)    There  are  several  possible  reasons  for  the  disparity  in  median 
incomes  between  men :  and  women .  _ 

-There  may  be  job  discrimination  against  women  (e.g.  lower 
level  of  job  assignments  despite  equal  training,  lower  pay 
for  the  same  job). 
-Women  may  t_end_  to  concentrate  their  studies  and  take  jobs 
in  fields  which  typically  pay  less  than  those  fields  which 
interest  men. 

-More  women  than  men  of  comparable  education  may  decide  to 
hot  market  their  skills  (e.g.  become  home-makers) 

These  three  explanations,  as  well  as  others,  all  probably  con- 
tribute to  the  disparity.    Further  study  is  needed  to  confirm 
or  deny  each.    The  job  discrimination  possibly  is  of  particu- 
lar concern  since  this  is  an  illegal  practice, and  could  be  dealt 
with  in  the  courts  ihspeclfic  cases.    The  policy  implications 
of  the  latter  two  explanations  are  less_clear  since  they  may  or 
may  not  involve  a  sex-related  choice.     It  may  be  that  if  women 
tendto  stay  out  of  particular  fields  of  work  or  study  some  type 
of  efforts  such  as  a  publicity  campaign  directed  towards  women 
shouldbe  undertaken  to  bring  their  talents  into  the  field  (e.g. 
the  military  has  recently  been  trying  to  attract  female  volun- 
teers) . 
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(a)  Gonorrhea  has  been  increasing  over  tlme^  while  syphilis  has  been 
dfec lining. 

(b)  For  1970:    -^^^ -^^  *  94  thousand  (Syphilis) 

 -J*-        ■  602,5  thousand  (Gonorrhea) 

(c)  The  linear  relationship  implies  that  a  good  prediction  for  the 
tiuniber  of  cases  in  a  given  year  can  be  based  on  the  number  of 
cases  during  the  previous  year. 

(d)  Syphilis:     78-80  thousand  (decrease  of  2-3  per  year) 

Hence:     1972-1971  97 

1973-  1972  76 

1974-  1973  56 

1975-  1974  3/4  (56)  «  45 
1976^1975  3/4  (45)  -  36 

45  -f  36  =  81 


Qionorrhca:  98G  thousand  (#  increases  by  about  3/4  of  previous 
years  increase) 

(899  HF  81)  -  980 


(e)    Clearly,  Gonorrhea  is  far  more  prevalent.    Further,  while  the 
tncicichce  of  Gonorrhea  is  ihcreasihg  ("but  of  control"),  that 
of  Syphilis  appears  to  be  oh  the  decline  ("under  control"). 
Sjot^,  hbvevcr,  that  the  data  appear  to  Indicate  that  the  rate 
increase  of  Gonorrhea  is  leveling  off. 

Since  the  spread  of  VD  is  crucially  dependent  on  the  number  of 
tncllvidaals  infected,  clearly  Gonorrhea  will  require  a  greater 
4mount  at  effort  to  bring  under  control. 


Em 
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y(t)  vs  y(t-l)  for  Syphilis  (Includes  year  1970) 
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Quiz  tmtt  3 


Time  *  60  minutes 

Suggested  problem  times  giveii.  Credit  is  roughly  proportional  to 
these  times. 


PArt  I.    Ahsi/er  5  of  the  6  fbllowiag  questions  (15  minutes) 


(1)  Give  two  possible  reasons  why  the  Y  observations  in  an  (X,Y)  data 
set  should  be  transformed. 

(2)  How  does  ah  ordered  multiple  batch  differ  from  an  unordered  multiple 
batch? 

(3)  What  representative  point  is  used  as  a  ''cbhditibnai  typical  value*' 
in  a  mini*batch? 

(4)  How  do  "outlying"  data  values  affect  a  least  squares  line  and  a 
resistant  line? 

(5)  What  are  "residuals"  from  a  line  fitted  to  ah  (X,Y)  data  set? 

(6)  According  to  the  study  cited  by  Tufte,  what  factors  affect  vbtihg 
rates  in  American  cities? 
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Part  II,    Answer  2  of  the  following  3  questions.     (20  minutes) 

(1)    Which  direction  do  we  move  on  the  ladder  of  powers  and  for  what 
variables,  if  the  scatterplot  of  the  raw  data  resembles: 


I) 


Module  II 

WRit  do  these  residaais  from  a  polished  fit  imply  about  how  weil  the 
linear  model  fits  the  data  (x-variable  -  X,  y  -  variable  »  residuais)? 


52B 
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7 

1 


2 


This  graph  is  from  Louis  Battan's  article,  "Cloud  Seeding  and 
Rainmaking."  (Statistics;    A  Guide  to  the  Unknown,  pp.  354-361). 

(a)  What  problem  does  Battan  discuss?    What  solution  does  he  propose? 

(b)  Interpret  the  graph.      (A  complete  answer  will  include  a  descrtpti 
of  the  plotted  points^  a  discussion  of  goodness  of  fit,  and  the 
implications  for  Battan's  hypothesis.) 
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Bart  III;    Answer  1  of  the  followltig  2  questions  (25  minutes) 

(i)    Consider  the  fDlloBing  data  set  on  U.S.  Coal  Production  (from 
J.  tokey,  Eacploratory  gata^^nalysis .  Chapter  7). 


year  coal  production 

1955  467  million  tons 

6  500 

7  493 

8  410 
?  412 

1960  416 

1  403 

2  422 

3  459 
*  467 

1965  512 

7  552 

8  545 


Note  that  the  production  for  1966  iS  not  available. 

(a)  Smooth  these  data  until  the  ith  smooth  is  identical  to  the  (i-l)th 
smooth.    Merely  use  the  actual  data  values  for  end  points. 

(b)  interpolate  to  find  coal  production  in  1966. 

(c)  Extrapolate  to  find  coal  production  in  1969. 

(d)  Are  there  any  apparent  trends  in  these  data? 
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(2)    Consider  the  following  data  set  oh  1976  and  pre-1976  malpractice 
insurance  premiums  for  Greater  Boston  hospitals  and  health  centers 
(from  D.  Hoaglinj  A  ^ir^t  Course  in  Data  Analysis »  chapter  5). 

(thousands  of  $) 


Institution 

old 

pr^^ii  IBS 

...  — 



11  cW 

premium 

Boston  Hospital  for  Women 

418 

Peter  Bent  Brigham 

1226 

Robert  Break  Brigham 

169 

1 9n 

u 

Children's 

866 

646 

Beth  Israel 

833 

635 

Massachusetts  General 

2263 

1682 

McLean 

255 

218 

Harvard  Health  Service 

136 

114 

Mt .  Auburn 

162 

148 

Sidney  Father  Cancer  Center 

66 

60 

Massachusetts  Eye  and  Ear 

350 

258 

Harvard  CbnBnunity  Health  Plan 

192 

136 

New  England  Deaconess 

435 

348 

Let  X  =  old  premium,  and  y  =  new 

premium.  We 

have 

calc 

alated 

Ex.    ^  7365 

X 

X  = 

566 

.5 

Sx?   =  8719409 

y  = 

423 

.7 

£y^  «  5508 
Ey^  =  4752590 
Ex^y^  «6432511 


(a)  Find  least  squares  estimates  of  the  coefficients  for  the  linear  model 
relating  y  to  x. 

(b)  Would  a  resistant  line  be  very^^f f erent  from  the  least  squares  tine 
for  these  data?    Why  or  why  not? 

^^9 
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Quiz  Unit  3 
Solutions 

I.    i.    Y  observations    In  an  (X>Y)  dataset  may  be  transformed  to 
equalize  variance  or  to  promote  linearity, 

2.  The  batches  In  an  ordered  multiple  batch  are  related  In  a 
quantitative  way,  while  those  In  an  unordered  batch  are 
related  only  In  a  qualitative  way. 

3.  We  use  the  median  of  a  mlnl<^atch  as  the  "conditional  typical 
value". 

4.  An  outlying  Value  pulls  a  least  squares  line  towards  It  but 
has  very  little  effect  on  a  resistant  line. 

5.  A  residual  from  a  fitted  line  Is  the  observed  Y  value  minus 
the  fitted  Y  vallie. 

6.  The  factors  vhlch  affect  voting  rates  are: 

(a)  closeness  of  election 

(b)  ease  of  voting 

(c)  socioeconomic  factors 


II.    1.     (a)  Up  onX2>down  on  Y 

^b)  Down  dii'X  and  Y 

(c)  Down  on  X.-,  up  on  Y 

(d)  Up  on  X.  aid  Y 

(e)  transformation  necessary 

2.     (a)    Good  fit 

(b)  There  is  still  level  to  be  removed 

(c)  There  is  still  tilt  to  be  removed 

(d)  The  data  needed  to  be  transformed       ^  zz:::_ 

(e)  Slope  arid  level  have  been  removed,  but  spread  of  the 
residuals  increases  as  X  increases.    A  transformation  of 
y  would  be  appropriate 


3.     (a)    Battaii  discusses  potential  drought  and  proposes  cloud 
seeding  as  a  way  to  increase  rainfall.  -  zi  zzz 

(b)    The  plot  is  of  the  square  root  of  rainfall  for  target 
area  (Y)   and  control  area XX)  •    Each  point  represents  a 
month;  the  "x"  point  represents  the  month  during  which 
cloud  seeding  took  place  in  the  target  area.    The  fitted 
line  describes  the  data  wsll,  implying  that  the  target 

and  control  areas  are  well^^^tched.  Since_the_ seeded 

month  is  not  an  outlier,  Battan*^s  experiment  provides 
na  proof  that  cloud  seeding  does  Increase  rainfall. 
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Final  Snooth 
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422 
459 
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HA 
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467 

493 

493 

412 

412 

412 

416 

422 

459 

467 
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532 
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467 

493 

493 

412 

412 

412 

416 

422 

459 

467 

489.5 

532 

545 

545 


Interpolated  value  is  532  million  tons, 
there  are  several  ejctrapblated  values  for  1969  which 
make  sense.    Tanx  may  argue  that  coal  production  is  leveling 
of f  at  545  niillioh  tons.    You  may  argue  that  coal  produc- 
tion is  gradually  increasing  and  will  Be  approximately 
551  million  by  1969. 

Coal  production  is  increasing  over  time.    There  may  be 
periodicities i  but  we  don't  have  enough  data  to  tell. 

^ .    £  CJCj-xHyj--y)  Exyr^-5cEyj-^Ex^+nxy 

g  m  ■   a:   —  — —   ^ 

E  (xj-x)  Ix|-25crxj_-hix2 
Exi  y i-x£y i-ynx+nxy 
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a  •  y  -  BS 

•  423.7-.7284(566.5) 
-  11.06 

(b)    Not  a  resistant  line  vcniid  not  be  very  different.  A 
plot  of  tbe  data  shows  that  there  Is  a  linear  relation- 
ship between  old  and  new  premiums^  with  no  outliers. 
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Unit  4  

Reading  Asstgmnents 


Lecture  Reading 

4-6  Prerequisite  Inventory 

4-1  Tufte,  pp.  135-148 

4-2  Wonnacott  &  Wdtxnacott ,  pp.  1-24,  53-67 

4-3  Tufte,  pp.  156-163 

4«4  No  reading 

Workshop  Handout:    ••What  to  took  for  in  Reading  Technical 

Reports'* 

4-5  Handoot:    ''Covariances  and  Independence  in  the 

Bivariate  Multiple  Regression  Model" 

4-6  Tufte ^  pp.  148-155 

4-7     :  No  reading 


in  addition,  ^e^d  the  following  articles: 

Kaplan,  Robert;  aid  Samuel  telnhardt,  "Deteminants  of  Office 
fcocation/'  g^dtcll  6are .  Vol.  il.  No.  5,  Sept. -Oct.  1973,  Pp.  406-415. 
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Prerequtstte  Inventory,  Unit  4 

Unit  4  of  Module  ii  is  concerned  with  multiple  linear  regression^ 
lie.,  the  fitting  of  linear  models  relating  many  X  variables  to  a  single 
Y  variable.    As  in  the  previous  three  units,  the  ability  to  master  the 
concepts  and  techniques  in  Unit  4  is  dependent  upon  the  mastery  of 
several  simple  mathematical  ideas.    Before  proceeding  to  Unit  4,  you 
should  be  very  familiar  with  the  topics  discussed  in  this  inventory. 

This  inventory  is  divided  into  the  following  sections: 

1.  Review  of  Units  1  and  2,  Batches  of  Data. 

2.  Revlgw  of  Unit  3,  Univariate  Regression. 

3.  Representation  of  a  data  set  ast  a  matrix. 

4.  Matrix  manipulations; 

This  unit  depends  heavily  on  Unit  3;     if  you  feel  that  you  do  not  have 
a  good  understanding  of  this  prior  unit,  please  consult  a  member  of  the 
course's  teaching  staff  for  additional  tutoring. 

Section  1.     Review  of  Units  1  and  2,  The  Analysis  of  Batches  of  Data 
A  good  review  of  the  concepts  and  techniques  of  the  first  module 

of  QMPM  is  given  in  Section  i  of  Prerequisite  Inventory,  Unit  3.  Detail 
concerning  the  construction  of  number  summaries,  schematic  plots,  and  sten 

ahd-leaf  displays  is  presented  there^  as  well  as  a  review  of  important 

terminology.     You  should  reread  this  section  since  this  material  is 

important  for  the  proceedures  and  concepts  of  Unit  4. 

Nuirf>er  Bimmaries,  schematic  plots ^  and  stem-and-leaf  displays 
may  be  drawn  in  parallel  when  analyzing  multiple  batches  of  data.  We 
merely  use  one  scale  (or  one  Mt  of  stems)  for  all  the  batches; 
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When  are  these  concepts  einpibyed  in  the  analysis  of  ccroplex  inuiti- 
variable  data  sets?    Since  such  data  sets  are  a  collection  of  batches, 
related  in  some  complicated  fashion,  the  tools  of  Unit  1  and  2  are 
useful  in  "getting  a  feel  for  the  data".     One  should  analyze  the 
fBU It i -variable  data  set  as  single  batches,  or  a  nailtiple  batchy  before 
fitting  the  desired  linear  model. 

These  tools  are  also  very  helpful  in  evaluating  hbw  well  the  model 
fits  the  data.     Since  data  =  fit  -f  residual,  the  single  batch  of 
residuals  froi  the  fit   is  extremely  important.    Residuals  as  a  batch 
are  occasionally  assimied  to  be  veil-behaved,  a  powerful  assumption,  not 
often  justified.    We  postpone  the  discussion  of  residuals    and  well^ 
behaved  batches  to  the  next  section. 
Section  2.    Review  of  Unit  3,  Univariate  Regression 

Unit  3  discussed  the  analysis  of  Ordered  multiple- batches .  a 
collection  of  batches  with  an  associated  scale.     For  example,  a  data 
set  of  the  number  of  live  births  of  women,  classified  by  the  age  of  the 
mother  at  time  of  birth  is  an  ordered  multiple  batch.    We  have  one 
batch  for  women  under  15  years  of  age,  one  batch  for  women  15-15  years, 
one  for  20-24  years,  25-29  years,  30-34  years ^  35-39  years,  40-44 
years,  45  and  over.     There  are  8  batches,  each  with  values  for  total 
number  of  live  births,  one  datum  for  each  year  from  1950  to  1967. 
Associated  with  each  batch  is  the  midpoint  of  the  age  interval. 
These  midpoints^  =  14,  17 *  22 ^  27 ^  32,  37 ^  42 ^  ^  46 ^  constitute  the 
age  scale  for  the  multiple  batch. 

We  showed  how  parallel  schemjStic  plots  are  drawn  for  ordered 
multiple  batches.     Each  plot  is  centered  at  the  correct  value  on  the 
scale  for  the  batch,  and  the  width  of  the  plot  is  made  equal  to  the 
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width  of  the  interval  on  the  scale  associated  with  the  batch.  As 
can  be  seen  from  the  live  birth  schematic  plot,  exhibit  1,  this  display 
summarizes  the  relationship  between  the  data  and  the  scale  values 
quite  well. 

To  summarize  further  these  ordered  batches,  we  compute  t3^ical 
values  for  the  values^  conditional  on  the  values  being  located  in  a 
specific  batch;     conditional  typical  values.    The  conditional  typical 
value  for  a  data  value  in  batch  i  is  defined  to  be  the  median  of  the 
batch.    The  conditional  typicais  for  the  live  birth  data  are  given 
in  Exhibit  2  .    Note  how  the  values  rise  and  fail  as  age  increases , 
similar  to  the  raw  data.     The  conditional  typicais  are  representative 


Exhibit  2 


Live  Births  by  Age  of  Mother 
Conditional  Typical  Valueis,  or  "Fits"  for  Each  Age  Class 


Typical  Value  of  "Y"  ,  Given  ^'g' 


Under  15 


6^700  births 


15-19 


560^000  births 


20-24 


1^310^000  births 


25-<9 


1,065,000  births 


30-34 


680,000  births 


35-39 


330,000  births 


40-44 


85,000  births 


Over  45 


5,000  births 
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values  for  each  batch,  and  reflect  the  relationship  between  the  raw 
data  and  the  values  along  the  ordered  scale.    Thiis^  we  have  the 
decomposition  of  data  values  into  conditional  typlcals  +  residuals* 
If  the  conditional  typlcals  provide  a  good  fit  to  the  data*  residuals 
will  be  small;  otherwise,  they  will  be  large.    Exhibit  3  displays 
the  residuals  from  the  conditional  typical  values  for  the  live  birth 
data.    The  loahy  far  out  points  indicate  some  lack  of  fit. 

The  analysis  of  (X,Y)  paired  observational  data  begins  with  a 
consideration  of  these  ordered  pairs  as  a  collection  of  mini-batches. 
We  use  the  X  variable  to  break  up  the  X  axis  into  several  intervals, 
and  then  group  together  the  Y  values  of  the  ordered  pairs  falling  in 
each  interval  into  a  single  mini-batch.    This  "chopping  up"  of  the 
X  axis  follows  an  examination  of  the  scatterplot  of  the  (X:,Y)  data- 
Exhibit  4  is  such  a  scatterplot  of  percent  of  the  population  illiterate 
in  1930  in  a  state  (X)  and  percent  of  the  population  illiterate  in 
1960  (Y) .     There  are  51  points,  one  per  state  and  the  District  of 
Columbia. 

The  scatterplot  is  used  to  break  the  data  into  mini-batches,  such 
that  tiie  intervals  on  the  X  axis  are  bounded  by  integers  approximate- 
ly of  equal  width,  and  contain  equal  numbers  of  Y  values-     It  tnay  not 
be  possible  to  achieve  all  three  of  these  goals,  but  we  must  rely  on 
our  professional  judgment  when  working  with  real  data. 

Once  we  have  achieved  the  reorganization  of  an  (X,Y)  data  set  Into 
batches^  it  can  be  analyzed  as  a  collection  of  ordered  batches.  The 
important  question  is:    How  linear  is  the  relationship  between  the 
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conditional  typical  values  and  the  X  scale?    To  answer  this  we  plot  the 
conditional  typicals  and  the  hinges  of  the  mini-batches  on  a  separate 
plot  and  connect  them.    If  the  reiationship  is  close  to  linear  and  if 
the  three  lines  are  roughly  parallel — as  are  the  lines  in  Exhibit  5 
of  the  illiteracy  data—then  we  are  in  good  shape.     If  the  plot  lacks 
these  qualities i  then  we  might  watit  to  transform  our  (X,Y)  data  set. 

With  regards  to  transformation  we  have  2  goals:     (i)  increase 
linearity i  and  (2)  equalize  spread.    The  plot  of  the  conditional 
typicals  and  hinges  is  quite  useful  in  assessing  how  far  we  must  go  to 
achieve  these  goals.     If  the  lines  connecting  the  medians,  upper 
hinges,  and  lower  hinges  are  riot  straight,  then  a  transformation  on 
the  X  variable  to  increase  linearity  is  needed,    if  the  lines  con- 
necting the  sufflDmary  quantities  are  not  parallel,  and  diverge  or 
converge  as  X  iricreases,  then  the  midspreads  of  the  batches  are  not 
constant.     To  equalize  these  spreads  we  transform  Y.    How  do  we 
determine  how  far  up  or  down  the  ladder  of  powers  to  move  with  X  and 
Y?    Eachibit  6     is  useful  in  this  determination.     Identify  the  shape 
of  the  scatterplot  as  one  of  the  4  functional  forms  in  this  display, 
and  transform  accordingly.     Finding  the  best  transformation  is  an 
iterative  process.    Try  several. 

Once  we  have  successfully  transformed  the  (X,Y)  data  set,  we  are 
now  ready  to  summarize  formally  the  relationship  of  Y  to  X.    We  fit  a 
line,  either  resistantly  or  by  lea^  squares,  to  the  (X,Y)  data  set. 
We  hypothesize 


Yi  =  a  +  bXi 


XVI. II. 194 


Module  II 


QMPM 


ERIC 


Module  II 


as  the  function  relating  Y  tb  X.    We  ball  the  model  fitting  process 
"regression",  and  state  that       is  regressed  on  X".    Since  ve  have 
only  one  variable  X  to  be  regressed  upon,  the  regression  is  a  "uni- 
variable"  or  univariate  regression,    Y  is  called  the  dependent  varia- 
ble^ X  the  independent  variable. 

Resistant  lines  are  calculated  by  breaking  the  data  into  thirds— 
or  3  equal  sized  mini-batches—and  computing  the  median  of  the  X's 
and  the  median  of  the  Y*s— the  conditional  typical  value—within  each 
third.    Label  these  three  summary  points 

(X^^,  Y^^)  «  inedian  of  first  third 
te^2)»  ^(2)^  *  median  of  second  third 
(X^2)»  ^(3)^    *    median  of  third  third. 

We  compute 

^^(3)  -  \^.)> 

and 

The  resistant  line  m^>  nee^J  t :  :tepri       polish  to  remove  all  the 

tilt  and  level  from  the  reszc>-    z:^  poHi:^,  we  fit  a  line  t^  the 

residuals  from  the  tvreviou^  xiu    ano         Vie  a  and  b  calculated  from 
the  polishing  to  the  a  and  6  fro:    ^h^  ,.r«-  iows;  fit. 

Least  squares  minimizes  the  sum       th^  jqufired  residuals.  We 
seek  the  a  and  b  that  minimise 
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i«l    ^  ^ 

Least  squares  provides  a  fit  very  simllat  tb  a  resistant  line  If  the 
data  are  linear,  the  spread  about  the  line  is  constant,  and  there  are 
no  outliers.    If  any  of  these  conditions  are  violated,  then  the  least 
squares  line  will  not  fit  the  data  well,  and  the  resistant  line  is 
preferable.    Resistant  lines  are  "resistant"  to  violations  of  these 
assumptions. 

We  compute 


and 


a  «=  Y  -  bx 

as  least  squares  coefficients  estimates.  To  evaluate  how  well  the 
least  squares  line  fits  the  data,  we  calculate 


2 


and 


2        5^  (Y,  -  a  .  bX.) 

S  .    »        ^"   L_ 

ylx  n  -  2 


2               CYv  -  a  -  bX,5' 
-  1        -    ^ 


D  (Y^  -  Y)^ 

2  .    .   .  _    _  .  ; 

^y|x  ^®         variance  about  the  line  and  should  be  as  small  as  possible, 
is  1  minus  the  ratio  of  residual  variation  to  total  variation,  and 

is  Interpreted  as  the  "percent  of  the  total  variation"  explained  by 

the  line.    The  closer  this  quantity  is  to  1  the  more  completely  the 

line  "explains"  the  data. 
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Residuals  are  defined  as 

*       -  a  -  bXj 

and  are  very  important  in  evaluating  the  least  squares  and  resistant 
fit.    Residuals,  treated  as  a  single  batch,  should  be  well-behaved.  A 
well-behaved  batch  is  symmetric  about  the  mean  of  the  batch;  approxi- 
mately 64%  of  the  batch  values  are  within  one  standard  deviation  of  the 
mean,  and  approximately  95%  of  the  batch  values  are  within  2  standard 
deviations  of  the  mean.    Such  a  batch  has  no  outliers.    This  well- 
behaved  assumption  is  crucial  to  least  squares  lines    and  will  be 
discussed  further  in  Unit  A. 

A  plot  of  the  residuals  versus  X  is  also  important •    Such  a  plot 
should  be  a  random  swarm  of  points,  devoid  of  any  pattern.    Any  pattern, 
such  as  trigonometric,  wedge,  linear,  or  curvilinear,  is  an  indication 
that  the  line  does  not  fit. 

Time  series  data  are  a  special  kind  of  (X,Y)  data.    The  X  variable 
refers  to  time  (months,  weeks,  days,  etc.)  and  there  is  one  associ- 
ated with  each  X^.    Time  series  data  contain  quite  a  bit  of  noise,  and 
it  is  usually  necessary  to  smooth  these  data  sets  to  filter  out  the 
irregularities.    Running  medians  of-  3  is  one  smoother,  and  involves 
taking  the  median  (►£  3  consecutive  data  values,  beginning  at  the  first 
time  point  and  working  down  to  the  last.    The  data  are  smoothed 
several  times  until  the  smoothed  values  from  the  ith  iteration  are 
identical  to  those  from  the  (i-i)th.    Erfiibit   7   is  a  time  plot  of 
emergency  registrations  at       C.  General  Hospital,  and  eadiibit  8  is 
the  smoothed  time  plot.    Note  how  many  of  the  peaks  and  troughs  have 
been  removed  by  the  smoothing. 
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If  the  tinte  plot  ehbvs  sufficient  trendis,  then  we  may  extrapolate, 
estimate  beyond  the  range  of  the  data,  and  Interpolate,  estimate  be- 
tween ivo  consecutive  time  points.    The  identification  of  periodicities 
«i:ch  '^u^  seasonal  highs  and  lows  Is  also  important. 

Sectldh  3.    Representation  of  a  data  set  as  a  matrix. 

Consider  an  (X^Y)  data  set.    This  data  set  contains  2  related 
batches  X  and  Y  of  fiqir>l.  size  N.    The  observations  in  X  are  denoted  x^, 
and  those  in  Y,  y^.     "^n  factj, 

X  -  (xj^i  X2i  35^5 
Y  -  (yj*  yj.  ....  y^.i.  y^) 

i.e.,  the  dat '4^^vector  X,  of  length  N,  can  be  represented  as  an  ^tupl^ 
of  values  x^^  through  x^^.     Similarly  for  Y.    We  have  written  X  end  Y 
horizontally;  henceforth,  we  shall  represent  these  vectors  as  vertical 
columns ; 


We  call  X  and  Y  eolamn  vectors  and  represent  them  with  little  letters 
underscored  with  "tlldas";    x  and  y.    All  vectors  will  be  written  as 
little  letters  with  tlldas:    a.  b,  etc.    The  length  of  a  vector  is 
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equal  to  the  number  of  observations »  N.  y»  In  the  linear  model ,  Is 
called  the  dependent  variable^  or  vector  of  dependent  observations. 

Unit  4  Is  cbticemed  with  data  sets  containing  a  data  vector  f  and 
more  than  one  x  vector.    We  relate  the  variable  Y  to  variables 
Xj,  X2>»--»         I.e.,  we  seek  to  "describe"  Y  as  a  function  of  p 
dependent  variables      »  • • • >  Xp.    We  need  a  convenient  mathematlca 

representation  of  the  variables  X^»  X2»  X^. 

We  present  an  example.    For  N  ■  10  eastern  states,  ve  have  data 
on  the  transportation  equipment  Industry  In  1957. 


State 


(ffiilltpn  $) 
Aggregatre 
Value  Added 


(million  $) 
Aggregate 
Capital 
Service  Flow 


(million  man-hours) 
Aggregate 
Man-Hburs 
Worked 


Connecticut  690 

Maine  29 

Maryland  415 

Massachusetts  24? 

New  Jersey  667 

New  York  940 

Ohio  1611 

Pennsylvania  618 

Virginia  174 

West  Virginia  23 


3? 

2 
18 
15 
33 
73 
158 
34 
7 
2 


124 
6 
69 
39 
83 
190 
260 
98 
31 
4 


Exhibit  14:    Regression  Data 


We  seek  to  estltnate  Aggregate  Value  Added  (Y) »  as  a  function  of 
Aggregate  Capital  Service  Flow  (X^^)  and  Aggregate  Man-Kours  Worked  (X2) 
This  functional  relationship  is  known  in  economics  as  the  Cobb-Douglas 
production  function. 

^^  ^ 
XVi;I1.2b3 
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The  depeadent  variable,  y,  is 


/ 


The  2  independent  variables  (p  «  2)  are 


The  elements  In       are  denoted  x^^,  i  ■  1,...,16,  and  the  elements  in 


X,  are  denoted  x^2  i  ■  1  .•••»  10-    Hence,  x^,  -  39,  x,,  ■  2,  x 


11 


21 


^10, 


-  2,  x^2  -  124, 


^^10,2  " 


Suppose  we  pilace  the  vectors       and  x^  side  by  side,  and  label 
this  "entity"  X.    We  have: 


053 
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We  call  X  a  matrix  (plural:    ^atrlees) .    A  inatrlx  is  merely  a 
collection  of  p  vectors.     It  Is  symbolized  by  a  capital  letter  under- 
scored with  a  tiida.    A  matrix  Is  a  2  dlmetisibhal  quantity^  character- 
ized by  first  dimension  »  number  of  rows  and  second  dimension  « 
number  of  cbluSis.    Our  data  matrix  X  has  dimensions  10  and  2,  and 
is  a  (10  X  2)  matrix.    Note  that  a  matrix  with  only  one  column  is  a 
vector.     If  N  «  p,  X  is  called  square;  otherwise  it  is  rectangular. 

In  general,  the  data  matrix  X  of  independent  variables  will  have 
dimensions  N  and  p.    The  elements  of  X  are  x^^  where  1  *  1,  2,  N, 
and  j  »  1,  2,  p.     In  multiple  regression^  a  cblumh  of  X  is  a 

single  variable,  x ^ ,  and  a  row  of  X  is  a  single  observatlbn—a 
multivariabie  observation.    The  observations  in  the  data  matrix  formed 
from  exhibit  14  refer  to  the  10  eastern  states.    On  each  observation 
(state)  we  record  capital  service  flow  (X^)  and  man-hours  worked  (Xj) • 
Remember  that  value  added  is  not  part  of  the  X  data  matrix;  it  is  the 
y  vector  of  dependent  observations. 


An  (N  X  p)  data  matrix  X  is: 


/ 


'11 


'21 


'12 


^22 


'13 


"23 


*2P 


\  *N1  ^12 


^3 


This  representation  will  be  used  throoghout  Unit  4. 
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Section  4.    Matrix  tnaftlpulatlon 
In  this  section  we  define 

(1)  Matrix  addition:    X  +  Y 

(2)  Null  matrix:  Z 

(3)  Matrix  multiplication:    X  Y 

(4)  Matrix  transposition: 

(5)  identity  matrix:  I 

(6)  Matrix  inversion:  x*-^ 

Matrix  addition  is  a  simple  operation.    To  add  2  matrices  X  and 
they  must  be  of  the  same  dimenslohSp  (N  x  p) •    Let  C  «  X  +  Y,     If  X  has 
elements  x^j  and  Y  has  elements  y^^ ,  then  the  (l,j)  element  of  C,  c^^^  , 
equals  x^.^  +  y^^  .    We  merely  add  the  corresponding  entries  of  X  and  Y. 
An  example  lllastrates  this,  if 

then 

-  .  /3  +  14  2  -F  17  0  -F  1  ^  -  /17  19  1  \ 
*l9+2         1+0         6  +  9/  111  115-7 

Subtraction  is  defined  as  follows:    If  £  -  X  -  Y,  then  £  -  X  +  (-Y); 
i.e.  it  is  addition  of  X  to  the  negative  of  Y. 

the  null  matrix  Z  plays  a  special  role  in  addition.  It  is  an 
(N  X  p)  matrix  of  zeros:  z^^  «  6  for  all  1  and  j.  if  x  and  2  are 
(N  X  p)  matrices^  and  Z  is  the  null  matrix,  then 
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Matrix  miltlpllcatlbh  is  slightly  more  complicated  than  matrix 
addition.     It  is  not  a  term  by  term  operation  of  multiplying  corres- 
ponding entries  I    This  is  important  to  remember.     In  order  to  multiply 
matrices  X  and  Y  we  require  that  the  number  of  columns  of  X  must  equal 
the  number  of  rows  of  Y. 

If  X  has  dimension  (N  x  p)  and  Y  has  dimension  (p  x  q) ^  then  the 
product  XY  is  an  (N  x  q)  matrix^  C,  whose  entries^  c^j ,  are  obtained  by 
summing  the  products  formed  by  multiplying,  in  order,  each  entry  in  the 
ith  row  of  X  with  each  corresponding  entry  in  the  jth  column  of  Y. 
Formally , 

Matrix  muitipiication  is  defined  as  multiplying  the  rows  of  the  matrix 
on  the  left  with  the  columns  of  the  matrix  on  the  fight.     In  general, 
X  Y  does  hot  equal  Y  X:    matrix  multiplication  is  not  commutative. 
*An  example  helps*  Let 

2  =  (i     4      1)  X^(°      *     1  \j 

Since  the  number  of  columns  of  X.  3,  equals  the  number  of  rows  of  Y, 
the  operation  C  =  X  Y  is  defined.     C  will  have  dimension  (2x4).  The 
first  element  of  C,  c^^,  is  formed  by  the  summing  the  products  of  the 
^1^^  row  of  X  with  the  flrgt  column  of  Y:     c^^  •=  2-1  +  1-d  +  (-6)  (-2) 
m  c^2        formed  with  the  first  row  of  X  and  the  second  column  of 

Y:    c.^-2-0+l-4+  (-6)-l  =  -2.    The  matrix  C  is 
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/(2-:  +  l'0  +  6-2  2-G  +  i«4-6-i  -2«3  +  l-2-6-l  2-0  + 1-6  ^  6-1  ^ 
^l.l-3-0-2-2      ^•f)-3-4  +  2-l      -1.3-3-2  +  2-1      1-0  -  3 -0  +  2- 1  / 

(14  -2  -14  -6  \ 

-3       -10  -7  2  / 

A  aquare  matrix  X  has  equal  numbers  of  rowis  and  columns.    It  has 
dimension  (N  x  N) .    A  square  matrix  is  symmetric  if 

Xj^j  ■  x^^      for  all  i  and  j. 

For  example,  the  matrix 

/9  6-3 
I   6       3  0 
X  -      I  -3       0  2 

\i4       2  4 

is  symmetric.    With  a  Square  matfixj  we  call  the  terms  x^^,  1  =  .... 
N,  the  diagonal  of  the  matrix.    The:  diagonal  of  the  above  matrix  is 
(9,3,2,1).    Note  that  the  diagonal  is  not  well-defined  in  a  rectangular 
matrix. 

The  transposition, or  transpose,  of  a  matrix,  X^,  is  defined  as  a 
reversal  of  the  (l,j)  elements  with  the  (j,l)  elements.    If  Y  -  X*^,  then 

If  X  Is  a  (N  X  p)  matrix,  then  Y        (p  x  N) .     Consider  the  matrix  C 
given  above. 
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r 

If  X  is  a  square  symnetric  matrix,  X  «  X^;  i.e.,  transposition  does  not 
change  the  matrix. 

The  matrix  X^X  is  quite  important  in  regress iorttX^X  is  the  matrix 
TOiitiplication  of  the  transpose  of  an  (N  x  p)  data  matrix  X  with  the 

t  -    _  ____  _  f: 

data  matrix.    X  X  is  a  square  matrix  of  dimension  (p  x  p) .     Let  Y  «  X  , 

and  C  «  X^X    *  Y  X.    C  has  elements 

Since  y^^  «  Xj^^.    The  diagonal  elements  of  £,  c^^^  are  the  sums  of  the 
Squares  of  the  columns  of  X: 

il    p.  ^1 

The  of f -diagonal  elements,  elements  with  i      j,  are  the  sums  of  the  ith 
cbltnm  of  X  multiplied  by  the  jth  column  of  X  and  are  called  "cross- 
products".    Note  that  £  is  symmetric: 

The  matrix  X^X  is  called  the  mat^rix  of  sums  of-aquares  and  cross- 
products* 

Just  as  multiplication  has  a  unique  identity  element,  1^  matrix 
multiplication  has  an  identity  matrix  1.    I  is  a  square  (pxp)  matrix, 
with  ones  on  the  diagonal,  and  zeros  elsewhere: 
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I  - 


1  e  b 
0    i  b 

0     0  1 


b 

0 


0 

b 


b 
b 
b 


i 
0 


Multiplication  of  an  (N  x  p)  mtrix  X  by  I  yields  X: 
X  I  "  I  X  «  X. 

blvlslor  of  isiatrlces  Is  quite  complicated.    Thd  process  is  known 
trlx- layers loa  and  Is  defined  only  for  square  matrices.    If  Y  Is 
1-      p)  matrix,  the  inverse  of  Y  is  denoted  Y~^,  such  that 


Y  Y 


-1 


Y~^  Y  «  I 


Inverting  a  large  matrix  cannot  be  done  without  the  aid  of  a  computer. 
For  atnall  matrices  *  we  have  the  following  result: 
If  ¥  is  a  (2x2)  matrix,  then 


.-1 


li^tZ  '  ^12^21^ 


21' 


yi2^(yiiy22  -  ^12^21 
"ii/^y 11^22  -  yi2y2 


Determining  the  Inverse  of  the  matri^c  X^X.  (X^X)"^^  Is  the  "key 
computation"  in  multiple  regression. 
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Homev"jrk 
Prerequlislte  Inventory  ^  Unit  * 

Let: 


State  whether  the  following  operations  are  valid #  and  If  sd^  compute 
the  resulting  matrix. 

(1)  A  +  D 

(2)  A  C 
C3)  At 

(4)  C^C 

(5)  A  +  B 

(11)  If  ?  is  a  (5x5)  hull  matrix^  what  Is  ZI? 

(12)  W^at  is  the  diagonal  of  the  matrix  C  given  above? 

(13)  Are  the  off  diagonal  terms  of  the  matrix  B  -F       well  defined? 

(14)  Are  any  of  the  above  matrices  synmetric? 

(15)  Prove,  that  for  any  square  matrix  F,  F  +       Is  symmetric. 

(16)  Compute:  A^A, 


(6)  C-1  C 

(75  D-i 

(85  B^^ 

(95  a'^B 

(105  A  -F  I 
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 Homework  Solutions 

Prerequisite  Inventory,  Unit  4 


"(i  H)  (M  n  .  fi^  1  m 
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10.    Invalid  bperatibn^  the  identity  Datrix^  ti  is  a  square  natrlXi 

(that  iSi  ^) 


0 

0 

0 

0 

0 

0 

0 

0 

Q 

0 

0 

0 

0 

0 

0 

0 

0 

6 

Q 

0 

0 

0 

0 

12. 


»11  "  ^ 

^22  *  1 
X33  -  1 


13.  Nbi  because  only  in  a  square  matrix  is  the  diagonal  well  defined, 

14.  Yes,  e 


15, 


16 


/a. 


. .  a 


*21  *22 


In 


/a^j  *21  •••  ^nl  \ 


\  *S1  *n2    • • 


nn 


£  +  £ 


/*U**11    *12-^"21  •••  'in-'^nl 
*2l""l2    *22  **22  •••  *2n-'»n2 


\H-  -   ^  B«         •••••••••••••   fi^ii  ^ 
nl       In  nn  nn 


nn 


I 


In  F  +  P',  the  (i,j)  element  is  a^^  +  b^^,  which  is  equal  to  the 
(j,i)  el^ent  of  ^j^j  • 


.  /26  40  12  \ 
I  40  90  59  J 
V12      59  91' 
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Lecture  4-a    latroductibri  to  Unit  4 
introduction  to  Unit  4^  Multiple  Regression 

Lecture  Conteat; 

1.  Introduction  to  objectives,  prbblra  ,  and  notation  of  Unit  4 

2.  Introduction  to  the  geometric  representation  of  multiple 
regression 

Main  Topics; 

1.  Specific  introduction  to  objectives  of  Unit  4 

2.  Notation  "or  Unit  4 

3.  Introduction  to  general  problem  of  Unit  4 
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Topic  1.    Specific  Infrbductidn  to  Objectives  of  Unit  4: 

I.    Questions  to  be  answered  in  Unit  4  (1) 

1.  What  isan  (ikjj,yj)  observatibbal  batch? 

a.  Data  set  consist iug  of  pfl  batches  each  containing 
N  obBervatidtiS 

b.  Data  set  containing  p  independent  or  x  variables  and 
1  dependent  or  y  \rariable 

c.  The  ith  observation  of  the  y  batchy  y^^  is  associated 
with  the  ith  observation  of  each  of  tne  p  x  batches 

d.  We  have,  thu8>  a  batch  of      associated  observations 
oh  p+1  variables 

2.  What  analyses  can  be  done  oh  a  batch  of  1  y  and  multiple 
X  variable  data? 

a.  What  kind  of  summary  cah  we  use  to  desc;ribe  the  data? 

ahs:    Kxpreiis  cdhditidna}  typical  y  as  linear  function 
df  x*c, 

b.  How  do  we  er»tiniate  fit. 

aha:    Use  least  square;  ih  multiple  regresslbh 

c.  How  do  we_  determine  whether  trahsf brmntions  would 
improve  the  summaiy? 

aha;    Examihe  ihdividual  x^y  batches 

d.  How  do  we  adjust  summarisation  to  hahdle  special 
situatibhs  in  ac*s? 

&ns:    Indicator  (dummy  variables) ,  splines, 
interactions >  quadratic  terms 

e.  Hdw  dd  we  judge  whether  the  summary  summarizes  the 
data  effectively? 

ana:    inference  dn  least  squares 

f.  How  do  we  judge  whether  the  ihdividual  x  variables 
are  related  to  the  y  variable  ih  iinpdrtant  ways? 

aha:     (t  statistic)    Inference  on  coefficients 


----  ---  ---     -  -       -        -     -  ho±:i  '^  Yoyi .  .  t  5iqoT 

8-    How  can  we  determine  whether  bur  evaluation  of  the 
(^5  ^-iB  39ld£«SjmBxfy-*wi^_s^a^  anoi:tByM3do  rf:}i  sdT  -I 

bnB  ^nsmsanB-iiB^  HB  bb  nrsrf:^  iJnBSSiqsi  jtibo  sw  at/riT  .B9:JB±p6aaB 
ans:-  Study  residuals     ;eosqa  I  +  q  nl  -inloq  b 


u    How^an  we  determine  fc^ether  the  jfitting 
iftSlf        apprdroiate  -for  tS^  datfi? 


procedure 


gnx3Bi£qse  fanB  isrf:JonB  ano  io  qo5_no„a5n9iii5snB:f:rB  i^asd:)  §axlx*l  .S 
qxtl  na  ablsxY  eldBxiBv  :Jn35n3q3b  srf^  ni075t  rrnsbn-pcsbnx  srfJ 
asaJBV      5q  ^  Ixq  ^iioiJoav  nriofoo  s  bns  aniflEv  x  >o  xxii^iiiTT 


J 


bnB 


asidBiiBv  q 


.X 


3  no  i::?  3-1^9  2  do 


ii3:t:tx-rv^  -^cf  nBn  2id:t  r;ordr.*joa  >r.r"rj-saT  r.-.T  .£ 

Ixq  ?ri      bns   axM  ex  X  ^i^nw 

b3i>o  lawol  bnB  aooi^iiBnT  acJon^b  a^si^isi  ijaso  :roq'i/j  Jsrf:j  IIbos.^I) 

(  .  ::iio:t 3j>v  o:joneb 

(.a^iBijpa  j2Bi)I  Toi  si  !l:o  nrnuloo  i>  n±s5no'j  -^Em  X  UEd:^  u3l.Xi  aJoH) 
nox^B:Jon  ^d  s^aU  .a'x  alqxiJiuin  d:Jxw  anoii^Baps  ^nxYbDzJs  sd  [i.cw  f>W  . 


,  xd  -f  s  =^       ;  atjdX 


-r-d  .... 

q  q 


^x^d  ^-  ^x^d  +  ^d 


ari:J  31  gd  ;3Te»:?9in3isq  TO  3:ina±oli:^ed.o  sa  ^d  sni  oJ:  T£2>£,-r 
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topic  2.  Notation 

1.  The  1th  obseryatloha  oh  each  of  the  p  4-  1  variables  are  (2) 
associated •    Thus  we  can  represent  them  as  an  arrangement  and 

a  point  in  p  4-  1  space: 

(x^j,  ^i2'   •••••  ^ij  •••••  ^ip'  ^i^ 

We  adopt  the  cdnvehtion  that  the  first  subscript  indicates  the 
Dbservatiotiy  the  second  indicates  the  variable 

2,  fiiitig  these  arrangements  on  top  of  one  another  and  separating 
the  independent  from  the  dependent  variable  yields  an  Nxp 
matrix  of  x  values  and  a  coluinn  vector,  pxl ,  of  y  values 


p  variables 


•  •  •    Xj  ~ 

.....  x^.  .. 

2p 

^2 

N 

• 

• 

observations 

and 

\2 

• 

i 

^1 

• 

• 

^P 

3.  In  matrix  notation  this  can  be  written 

X    and  y^ 
where  X  is  Nxp  and  jj^  is  pxl 

(Recall  that  upper  case  letters  denote  matrices  and  lower  case 
denote  vectors.) 

(Note  also  that  X  may  contain  a  column  of  Is  for  least  squares.) 

4.  We  will  be  studying  equations  with  multiple  x's.  Use  b^^  notation 

for  coefficients. 

Thus ,  y  «  a  +  bx ,      y  . 

and  generally^ 

y  -  b^  +  biX,  -f  h^Xry  +  b  X 

We  refer  to  the  b^  as  coefficients  or  parameters;  bg  is  the 
'bonstant*'  term*    The  equations  are  linear  in  the  coefficients. 
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Topic  3.     Ihtrbductlbh  to  general  proElem  of  Unit  4 

1*    What  are  some  exa^les  of  one  Y  and  miltiple  X  data? 

a.  Cost  of  accident  and  speed,  veight  of  car^  age  of  driver. 

Y  is  cost  In  dollars  (continuous) 
is  speed  in  inph  (continuous) 
is  weight  in  pounds  (continuous) 

X^    is  age  in  years  (cont:'  "lous) 

Q:    What  is  typical  cost  of  ?<n  accident  given  speed, 
weighty  age? 

What  are  the  marginal  effects  of  spesd,  weight  and 
age  on  the  typical  cost? 

b.  IQ  scores  and  average  pupil  'ri  a  school  system,  age 
of  pupil I  number  of  sibling  rder^  sex. 

Y  is  IQ  score  continuous) 
X^  is  outlay  in  dollars  (continuous) 
X-  is  age  in  years  (continuous) 
X^  is  sibling  count  (discrete) 

X      is  birth  order  (discrete) 


X_     is  sex  (8/1)  (indicator) 

Q:    What  is  typical  IQ  score  given:    outlay,  age,  siblings, 
order,  sex? 

What  are  their  marginal  effects? 

Are  ail  important? 

Median_years  of  education  in  a  Pittsburgh  census  tract 
and  population  density,  median  age,  percent  poor,  percent 
nonwhite. 

Y      is  education 


X^    is  population  density 
X^    is  age 

^68 
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tarn 

Is  povert 
Is  nonwhlte 

(X3    is  poverty  X  nonwhlte)  interaction 
2.    How  do  we  construct  the  summary? 
Data  -  Fit  +  Residual 

"  Conditional  Typical  +  Residual 
-  C(y|x^X^  ....  X-)  +  Residual 

We  assume: 

cCyfx^x^  ...       *      *  bjx^  +  b^x^  b  X 

_____  P  P 

-:^nd  for  the  ith  observation: 

^^^'^11^12  ••••  ^ip)  -  ^  -       +  b^X^^  +  b^X^^  +  ...  + 


thus; 


P  P 


l°'%..'v!  ^^^^"^^  ^ince  exponents  of  the  bs  are  all 

1.    The  Xs  may  have  any  exponents.    Jbst  read       as  Z. 


3.    An  example;    Nations  data 
a.  Representation 

Y      ■    life  expectancy 

»    per  capita  income 
X^    ■    infant  mortality 


^63 
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We  have  for  each  nation  (J9  observations) 

yi  ■  ^0 ^1^11 ¥^12 

Pile  on  top  of  one  ^pther 
n    -    Bo^i>  +  h  ^11  +  ^2^12 


or  In  matrix  notation 


Xb 


1    X^,  X 
li  12 

^    ^21  ^22 


\  ]    X  X„ 
\       99  1    99  2 


Visualization 

..   _  ^ 

Equation  y 

What  does  sunnnary  involve  geotnetrically? 


•/•'o\ 


+  b  J  \^  +  b^X^    is  plane  in  3-space 


i.    Simplified  case:    X  v&riables  High  or  Low 

ycheoatize 
Display  in  3*space 
Fit  isvir^ry — Connect  medians? 

Fit  plane? 

ii.     General  situation:    continuous  data 

poitit  cloud  (plotting) 
consld^i  solid 
fit  plane 


equation  y 
residuals  f 
I^terpr^^catio^ 

Eow  d*)  we  chccs^.  the  plane? 

?7hat  £:bout  mors  X's? 


,  ,   V-^  Vi  + V2 

residuals  from  y 


mm 
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Lecture  4-D 
Transparency  Presentation  Gtxi  i  * 


Lecture 
Outline 
Location 

topdtc— L 
Section  A. 
1. 

1. 

j'ofiic  3 
Section 
3.b 
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Transparency 

 Nur'^^r 


overlay  3 


Transparency  Description 


Questions  for  Unit  4 


Data  Repret  2ntation 


Plane  defined  by  and 
Schematic  plots 


c.i 


c.ii 


overlay  A 


overlay  5 


Plane  connecting  Conditional 
typicais 

Point  cloud 


c.ii 


c.ii 


c.ii 


overlay  6 
8 

overlay  7 


overlay  8 


Ellipsoid  defined  by  Point 
Cloud 

Regression  plane  passing 
through  ellipsoid 

Residuals  from  ellipsoid  to 
plane 
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Module  II 

Lecture  4*1-    Multiple  Regression  Using  Least  Squares 
ffoitipie  Regression  using  Least  Squares:    Algebraic  Computations 

Lecture-Content ; 

1.  ^e  Model 

2.  Least  Squares  Estimation 

Mai^  Topici ; 

1.  Algebraic  Representation  of  the  Model 

2.  Matrix  Version 

3.  Least  Squares  Solution General 

4.  Least  Squares  Sbluti6n--Uhivariate 

5.  Examples  of  Computer  Generated  Fits 


(There  are  no  transparencies  for  this  lecture.  Material  should  be 
developed  on  blackboard.) 


Topic  li    the  Model 

li     Basic  Issue— Presentation  of  Model 

ii  eeneral  case:  N  observations^  p  variables 
2.     ith  Equation 

Yl  -  C(y^iXj_j,  X.-,  ..i,  X,  ) 


5. 


'i2' 

b0  +  hjX^^  + 


ip 
P  IP 


3.  Note  that  this  equation  is  linear  in  b^ 
A.    Matrix  notation 


/ 


V 


or 


We  can  "stack"  the  y.  Into  a  vector  and  the  into  a 
isatrix.    Remeinber,        there  are  N  rows. 


Xj^2  •  •  •  •  X 


21    22  2P 


^/  V  %i  %  ••••  ^p/ 


b^ 
p 


or   y    -  X  b 


Ii  .    Problem — This  is  only  a  conceptual  model 

1.  it  defines  a  surface  in  p+1  dimensions 

2.  General  equation  for  surface: 

eCy|x^,X2,  ....  Xp)  -  bjj  +  b^Xj  +  ...  +bpX_^ 
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3*    The  actual  data  points  (y^'ej  36  hot  lie  exactly  on  it 

4.  How  do  we  choose  the  b's  such  that  the  surface  is  a 
reasonable  summary  of  the  point  cloud 

5.  We  $l8b  d<^.sire  the  y^*s  to  be  "reasonable"  typical  values 
of  yi  given  Xi. 
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topic  2.    teast  Squares  Estimation 

I*    Basic  l88ae-«-Hinimize  sum  of  squared  residuals 
i.    Choose       so  that 

Zfy^  "  y^)      is  minimized 


2.  Solution: 


3.  X"^  must  be  "non-singular" 

4.  Y  «  Xb 

II.    Sbluti6n~Least  squares  calculations 
1.     (X'X)  « 


V 


Symmetric  matrix.    Sums  of  squares  and  cross-product 


^^12^1 


V^^p^i 


5S3 
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3.  These  calculations  are  straight forward  computationally 

4.  Difficult  task  Is  Inverting  (X^) 


III.    Method — Univariate  Situation  (p»l) 

'7 


1 


2.    C(y!x)  «  y  -  Xb 


1  X- 


0  12 


3.    To  deterndne  b  and  hence  calculate  y,  we  must  compute 

b  -  (X'Xj'^y 
A.  Calculation 

a.    First  evaluate 


ite 

/l  i  ...  1  ^ 
'VI  *2        ^  / 


ix^  V 

Z  X^  E  X  j 
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Secondiy,  evaiaace 
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\    ~  /  ^       1     ••••  1] 
(S^t)  -(xj  ....   3^  ) 

c.    Thirdly,  we  mast  compute  (x'x)  ^ 

m"^  «/  d/ad-bc  -b/ad-bc\ 
t-  c /ad -be  e/ad -bey 

ii.    Bat  x'x  ±8  synanetric,  and  b=c 


lii.    Hence    m"^  - 


) 


_  _       -  --i  _ 

iv.    Note:    If  ad-b^  ^  0>  Jg       cantidt  be  computed 
V.    This  occurs  when  the  X's  are  nearly  constant 
vi.    We  have 

a    «  N 
b    «  ZX^ 
d    »  Zxi 


"i 

ad  -       -    N  ZX^  -  (EXj)^ 


vii.  Hence 


(X'X)'-"  =   ITT  -TV 


^55 


d.  Xastiy, 


N  S  (X^-X)  \^ 


Exf  Ey^ 
NEX^^ 


e .    Thus , 


EX^Zyj  -  EX.y^  EX^ 

b    =    _  2  

^  KE(X^-X)^ 

y  EXj^^  -  xEx^y^^ 
N  E(X^-X)^ 


=    -  j 

E<X^-3F)2 

E(X^-X)(y^-y)  cor  (X^y^ 

£(X^-X)^  Var(X) 


bp  *    y  -  bjX 


If  p>l|  Solution  Is  more  complicated 
Examples 

a.    Llfs  Expectancy      Per  capita  income  -h  infant  mortality 
i.    Ij;  «  bg  4-       PCI  -I-  IM 
y  -  bQ  +  bj  Xj  +  b^X^ 
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it.    income  tn  $ 

Mortality  in  deaths /iOOO  live  births 
Life  expectancy  in  years 
N  -  99 

iii.  Model 

yj^  -  53.36  +  .GOSXj^j^  •  .OSSX^j^ 

b •  Interpretation 

i.    Typical  incr^wnt_in  life  expectancy  for  $100 
increment  in  income  is  .5  years 

11.    ^P|^al_ decrement  in  life  expectancy  for  100 
infant  deaths  is  5.8  years 

iii.    Relate  these  results  to  batches  and  fitted  plane 

c^    Is  this  interpretation  similar  to  ttxat  obtained  via 
2  separate  regressions? 

No: 

y  «    46.88    +  .ddtXj^ 
y  «    61.51    -  .0863^2 
Only  trae  when  Cov(X.  ^X^)  «  0 
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Lecture  4-2,    Trans  format  Ibhs 

Using  least  squares  procedures  to  esttmate  alternative  functional  forms 
for  the  conditional  typical  summary:    Fart  I-Transfonnattons 

Lecture  Cbhteht; 

1.  Ihtrbduct ion- -Data  Analysis  and  Theory  Testing 

2 .  Transformations — 

Main  Topics; 

1.  Purposes  of  trans formations 

2.  Cbhstriictihg  models  with  transfbfmed  variables 

3.  Interpreting  trahsfbnhed  models 


(There  are  no  transparencies  for  this  lecture.) 
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Topic  IL:    Introductibh  to  iisiiig  traiisf brmied  variables  In  iniiltlple  llnisar 
regression 

-      -_  -  * 

I.,    Basic  Issue:    Estimate  4>  iti  v  «  JC^  vherl  the  exponent 
may  vary  across  Xj- 

_  -_  A  _-  -_  _  _  '    _       1  1    _   _  _  11  

1.  Recall  2,  «=  X^  is  linear  ii^  b  and  is  conceptuai  model 

2.  We  inay  want  to  sunsDarize  the  data  with 

r^  r 


Z^,        ...  Zp)  ^  bg  +  bjZj  ^  +  b^z/  +  ...  b^ZpP 
r . 


we  can  set  =  X^^ 

and  get 

e(y|x^,  X2  ...  3^)  =  bQ  +  b^X^  +  b2X2  ...  +  tp)^ 
Now  we  can  use  least  squares  to  estimate  b. 
3.    We  may  want  to  sunsnarize  the  data  with 


y  *  CCyjX^Xg. .  "=  X|      •  Xg 


We  need  transformation  to  linearize.    Suppose  we  need  a  log 

trans  fornoat  ion.  Then 

A  :  _        _    _  _  _ 

y  -  logCClX^.Xg)  '  b^logX^  +  bglogXg  ,  when  P  «  2. 
This  is    now  linear  in  h  . 

II.     Distinction  between  testing  theory  and  doing  data  analysis 

1.    Theory  may  impose  functional  form  for  conditional  typical 

a.   Distance   in  free  fall:     (V    «=  0) 

o 

D  •=  2  8t 

Then 

log  5  «  log. 5  +  log  g  +  2  log  t 

But  set:         y  "  log  D  Xj^  •=  log  g  log  t  -  X, 

bp  -  log. 5  ^1  "  1  bg  «  2 


2 


Thus  J  9  "  bg  +  bjX^  +  bjXg 

:.ii.25o 
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Analysis  Involves  determining  whether  empiriCHl  data 
yield  coefficients  that  are  close  to  theoretical 
prediction. 

b.    Liquidity  preference  ftmctibSi 


where    I  is  interest  rate 

quatitity  of  money 
L*  constant  ''transactions  demand  for  money" 
L-t*  is  speculative  demand  for  money 
is  minimum  level  of  interest 

Function  is 


ji 


Graph  is  rectangular  hyperbola  with  curvature  depen- 
ding on  bj^. 

We  get 

1  «       +  bj^X       as  model  for  conditional  typical 
we  have 

Go  backvards,  i.e.,  tintransfbrm  once  b^  and  b.  are 
estiniated  from  data.  ^  *■ 


ERIC 
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c.    Several  possible  forms:    government  bureaucracy  and 
population  size 

I 


Bureaucracy 


^bpulat ion 

b  «  1      constant  proportion  between  B  and  P 
b  <  1      ecbhdmies  of  scale 
b  >  1      Parkinson's  law 
log  both  sides 

log  B  «  log  C  +  bj  log  P 

_    .  _      ^     _      _._      _  _   

let     log  B  -  y ,  log  C  =  bg,  log  P  *  X 

;      -     bg     +  bjX 

Data  may  require  exploratory  analysis  to  find  scales 
(dixnensiohs)  for  variables:    do  each  variable  pair  (X-.^y) 
individually.  ^ 

a.    Life  expectancy  vs.  Infant  mortality  and  per  capita 
incorne 


tF  =_pei 

Mort 
pei  +  Sort 
log  pei 

log  Mort 
log  jpei  +  log  Mort 

3*    Distinctibh  between  variable  and  regressbr  (carrier!.  May 
have  fewer  indepehdeiit  variables  than  terms  in  equatlm 
X  may  include  functions  bf  X  and  cross  prbducti  of  X;|^ 


^2 

r2 

46.88 

.QQ7 

.52 

61.51 

-.086 

.35 

53.36 

.665 

-.058 

.65 

1.30 

.158 

.67 

2.08 

-2.04 

.59 

1.62 

.106 

-.102 

.75 

a.  Polynomial 

b.  Cross  products 


X,  X^,  X* 


X^X j ,  Xj^XjXj^ 


i  is  "order" 

2 

X^  Xj  etc, 


^91 
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e.    Interpretation  of  cross  product  terms  as  "Interactions*' 

i.  Mathematical 

y  -  Sq  +  bj^X^  +        +  ^^^^^       34  has  varying 

slope 

-    «  b_  +  b.X.  +  b-X,  +  b,X,X,  "different  siTJpM 


1  1       2  2       3  1  2 

ii.    Substantive — multiplicative  effect 

^1 

<  > 


for  different  folkcl" 


I 

II 

III 

IV 

Additive 

i    «=   <  +  < 
II, III    -   <  +  > 
IV    -   >  +  > 


Multiplicative 

<  .< 

<  .> 
>  .  > 


592 

3CVliIli2ft3 


QMPH 

Topic  2.    Trans  if  brmat  ions 

I.  with  r  positive  dbtiteger 

1.  First  order  models:    r  *  1  . 

2.  Second  order  models:     r^  ■  1  or  2 

a.  y  -  bg  -f  bjXj  +  h^y^ 
Set  Xj  *  Z 

=»   y  ^  bg  +  bjXj^  +  bgZ 

(One  variable;  two  regressars.      Use  OLS) 

b.  y  =  bg  +  b^X^  +  '»2^i  +  1^3X2  +  +  ''5^1^2 
Set    X^  =  2^,  Xj  «  Zj,  X^X2  -  Z3 

____  _  ___  1  __ 

Get  y  =  bp  +  b^X^  +  t'2^1     ^^3^2     V2  ''S^S 

(Two  variables:    Xj^.Xj;  five  regressors:  -  Xj^,X2,  X^* 
2 

^2*  ^1^2^ 

3.  Third  order  models,  etc.    Involve  3  or  more  Xs  multiplied 
(can  be  X^^X^^Xj^,  ^i^J^*  ^i^i\*  ^^'^'^ 

4.  Note:    forms  may  be  suggested  by  theory  or  residuals 

11.^     r  not  a  positive  integer 

1.  Reciprocal  r  ■  -1 

■''o  +  h  17  +^2  ^  +  ••• 
^1  "  ^       ^2  "  ^  ••• 

Get    y  i  Bp  +  b^Z^  +  b^Z^  +  ... 

2.  Logarithmic 

y  i  bg  -»-  b^  log  X^  +        log       -f  ... 
4.    Square  root 

^  s  bg  +  bjXj*^^  +  bjXj^^^  -»-... 
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Ci    In  general 

1.  P  -  bp  +  bj^X^  +  b2X2  +  ... 

.  _  _  '2 

2.  y  «  bp  +  bj^Xj^^  +  b^Xj    +  ... 

5.    Use  exploratory  tools   or  theory  to  find  each  r. 
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Topic  3.    Interpreting  tran8formation8-*<-Some  transformations  have  aseful 
substantive  or  mathematical  interpretations 

I*.    Univariate  case^-^logarittms 

Four  cases : 


Variable 

X 


Dependent 
Variable 
Y 


Not 
Logged 

Logged 

Not 
lagged 

i 

ill 

Logged 

ii 

IV 

i*    Case-I     Both  dependent  and  independent  variables  linear. 

Conceptual  model:    y  «  bg  +  b^x 
Differentiate  both  sides  with  respect  to  x: 

Thus,  bj,  the  slope  of  the  line,  is  the  amount  by  which  y 
changes  for  a  unit  change  in  x. 


2.   Case  II    Dependent  variable  logged,  independent  variable  linear. 
(Called  "log-linear. "J 

Conceptual  model:    y  •  bgC^l^ 


log  version  (taking  logs  of  both  sides) 

ibg  y  ft  iogbQ+  h^x 
Differentiate  both  sides  with  respect  to  x: 

dx  ~  1 

y 


But  the  left  side  of  this  is  the  ratio  of  the  proportionate 
change  in  y  to  a  unit  change  in  x: 
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dx  ^        dx  1 

y 

So  t,  can  be  Interpreted  as  the  proportion  y  changes  for 
a  unit  change  In  x  or  100  x  b^^  gives  the  percentage  change 
in  y  for  a  unit  change  iii  x. 

3.    Case  III    Dependent  variable  linear,  independent  variable  logged. 
(Called  "linear-log.") 

Conceptual  model:    e^  ■  b^x  1 
log  version: 

y  «  log^o  +  * 
Differentiate  both  sides  with  respect  to  x: 

dx  X 

Multiply  both  sides  by  x: 

M  X  .  ^ 
dx  ^1 

But  the  left  hand  side  is  the  same  as 

M^.#.b. 

dx        ^  1 

So  b,  can  be  interpreted  as  the  ratio  of  the  amotant  y  change 
to  a  prbjSdrtidtiate  change  in  x.    Thus  bj^/iOO  can  be  read  as 
the  amount  y  changes  when  x  doubles,  i.e.,  increases  by  160% 

Both  dependent  and  independent  variables  logged  (Called 
"iog-iog.") 

Conceptual  model:    y  b^x'^l 
log  version: 

log  y  «  log  bg  -F  bj  log  x 
Differentiate  with  respect  to  x: 

dx  "  X 

y 
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Multiply  both  sides  by  x: 

But  left  side  is  ratio  of  proportionate  change  in  y  to  propor- 
tionate change  in  x: 


dx  ^     dx/x  1 

Or  it  can  Be  read  as  the  ratio  of  the  percentage  change  in  y 
for  a  percentage  change  in  x— but  this  Is  elasticity.  So 
can  be  interpreted  as  the  elasticity  of  y  vith  respect  to  x. 

II.    Multiple  X  stiuations 

1.  Exponential  model 

e^o*  Vl+  ••• 
log  ^  s  b^  +  h^X^  +  +  .  .  . 

2.  Reciprocal 


^1^1    ^2^2  ^  • 


r  -  bj  +  b^Xj_  +  bgXj     .  • 


y 

III.  Notes: 


1.  The  least, squares  estimates  apply  to  the  transformed 
models  btily 

_  ___  _  _       _A       __  ___  

2.  Avoid  trails  forming  y  if  possible.    This  rn&y  have 
consequences  for  inference 

3.  Discuss  problems  in  expanding  the  number  of  parameters 

to  fit  the  data.  Issues  of  parsimony^  complexity* and  the 
substantive  context  of  the' problem 

4.  Always  redefine  variables  as  variables,  parameters  as 
parameters 


0S7 


Module  II 


Lecture  4-3.  Indicator  Variables 

Using  least  squares  procedures  to  estiimate  alternative  functional  foinM  for 
for  the  conditional  typical  summary:    Part  II — Indicator  Variables  (i) 

Lecture  Content: 

1,  Constructing  variables  and  data  sets  for  indicatbr  variables 

2.  Interpreting  models  containing  indicator  variables 

Main  Topics: 

1.  Intrbductioh  to  indicator  variables 

2.  Simple  6/1  Indicator  variables 

3.  Linear  and  other  functional  forms  for  indicators 

4.  Splines-— Shifts  in  intercept  and  slope 

Tool^ introduced; 

1.  Indicator  Variables 

2,  Splines 
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Topic  i:    lutrodactiDn  to  indicator  variables 

I.    Basic  issue:    Effective  sumary  of  data  may  require  tfie^ 
construction  of  "variabics^'  and  "data"  to  yield  an  appro- 
priate form  for  the  conditional  tjrpical  for  certain  data 
sets 

i.    An  X  may  be  categorical  rather  th^  continuous. 
Example:    Race,  sex,  region,  season 

2*    An  X- may  takea  known  or  hypothesized  functional  form 
but  data  may  be  lacking. 

Exa^le:  Ordered  categories  such  as  income,  educational 
attainment 

3.  Time  trend8_My_be  Sttsp^ 

Ex^ple:    6rovth  in  population^  sales  volume,  salaries i 
property  values,  inflation  rate 

4.  Curves^  cyclic  behavior,  or  otber  consistent  changes  in 
intercept  and  slope  may  be  evident,    ^amp^^t  Admittance 
volume  for  emergencies  in  a  hospital ,  number  of  enroiied 
participants  in  training  programs 


II-     How  can  we  construct  alternative  functional  forms  for  the 
conditional  typicai_which  will  permit  us  to  use  least 
squares  estimation  procedures  in  these  special  situations? 

1.  Hse  indicator  (dummy  or  switching)  variable  which  takes 
on  value  of  i  i^en  special  condition  holds  and  is  0 
otherwise  for  categorical  variables 

2.  Bs^  indicator  variables  which  take  oh  linear  or  other 
forms  (quadratic,  etc.)  i^eh  trends  or  functional  forms 
are  expected  and  slope  is  different 

3.  Use  linear  splines  (connected  or  dlscbhtiected)  to  track 
a  special  curve  where  shifts  in  intercept  and  slope  are 
expected 
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Topic  2:    Indicator  variables  which  give  rise  to  shifts  in  the  intercept 

I*.    Categorical  variables  oniy  (for  heuristic  purposes) 

The  typical  value  may  depend  upon  the  state  of  the  categorical 
variable 

1.    bicfabtODaou8--*b/i:    Two  levels  or  states 

a.  Exmpte;  i^i^^.c^^pectancy  for  industrial  and  : 

nonindustrtai  nations 

b.  We  can  construct  a  model  for  the  conaitional  typical  (2) 
(mean)  as  follows: 


EKLC 


C(LE  I  National  status)  «  7  ^       ^  ^1^1 


Cl  if  nation  is  industrial^ 
where  o  ot 


otherwise 
Interpretation: 

Effetct  of  model  is  to  estimate: 

^      -  -  -      -  _____ 

y  «  bp  when  nation  is  not  industrial 

y  «  bp  +  bj  when  nation  is  industrial 

I.e.,  when  not  ind jstrial^mbdel  estimates  typical 
as  line  fibrizontal  to  X  axis  with  intercept  «  b^. 
When  industrial*  typical^  cbhditibnal  on  being 
industrial,  has  intercept  bg  +  b^.    The  means  of 
the  nonindustrial  will  be         of  the  industrial, 
bp  +  bj^.    The  value  of  b^^  indicates  how  different 
the  two  groups  are.  (3) 


Thus  X  is  the  variable  that  indicates  which 
categbry  bf  natibn  is  being  cbnsidered. 

c.    Can  estimate  t&odel  in  b.  using  OtS: 

y  -  49.49  +  22.X8X  r2  ■  ,44 

interpret  ^result 

2.    PolychbtbiiH3U8~Hore  then  two  levels  or  states 

a.    Example;    Life  expectancy  for  industrial,  

nonindustrial  and  petroleuin  exporting  countries 


Boo 


b.    Cohdltlonal  typical  model 

C(L  ElSatlonal  Status)  -  1^  -       +  ^^X^^  +  bgXj  C4) 


iihere  Xj  National  Status 

0  p  Petroleutt  Reporting 

q  1  Monlndustrial 


1  0  Industrial 


OLS  estimates 


y  -  50  +  21.67X^  ^  .57X^  -  .A3  (5^) 

3,    Generally:    if  categorical  variable  Has  r  states,  r-1  (gj 
indicator  variables  are  required 


il.     Continuous  and  categorical  variables  combined.    He  can 
combine  continuous  and  categorical  variables  to  yield 
summaries  of  data.     (Cf.  analysis  of  covarlance.) 

1.    Dichbtbmbus  indicator 

a.  Example:    Life  expectancy  by  per  capita  Income  and 
industrial  status 

b.  Conceptual  model 

C(LE  j  status)  «  y  «  bg  -F  b^X^  +  bjX^ 
where 

Xj.       is  per  capita  income 

-[J 

Interpretation  of  b. 


^  -  |A  if  nation  is  industrial 
^2  ■  to  otherwise 


i 

y  «  b^  +  b  X.^  when  nation  i  is  hot  industrial 
0       i  il 

y  ■         +  b-)      bjX^j^  when  nation  i  is  industrial 

l.c.»  bj  is  t3T>ical  shift  in  LE  conditional 
on  being  industrial.      Mote  that  the  slope  of 
the  line  relating  life  expectancy  and  per  capita 
income  ±n  the  same  for  both  industrial  and  non- 
industrial  nations;  only  the  level  is  different. 
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d.    OLS  estim&tes: 

i  «  47.15  +  •OOSXj  4.72X2 

Another  example;  Life  expectancy  by  logCper  capita  ^<7^ 
incoiiie)  and  industrial  status 

a«    Conceptual  model: 

y  -       -F  b    log  X,  +  b  X. 

where  JCj^  and  X^  are  defined  in  1. 


b. .  OLS  estimates: 

y  m  4.74  +  18.51Xj^  +  I.6IX2  -  -72         \  (8) 

Polychbtdmbus  indicator 

a.  Example:     Life  expectancy  by  per  capital  income^ 
infant  mortality^  industrial  and  petroleum  exporting 
status 

b.  Conceptual  model: 

C(LElPel,  IS,  Status)  *  ^  *       +  b^X^^  +  b^Xj  +  63X3  +  b^X^ 

where  X^  is  per  capita  income 
Xj  is  infant  mortality 

X^  and  X^  are  indicator  variables  for  status 

X-         X^         status  „ 
r  &         industrial  _ 

0  1  ?9?^?^?^^^^l  

6  d  petroietm  exporting 


c.  Interpreting  b^ 

A    _  ___  .__  \  

y    -    bp  +  bj^X^  +  l>2^2  petroieam  exporting 

■  (^0  +  ^3)+  bj^X^  +  b^Xj  when  industrial 
-  (bg  +  b^)  +  bj^Xj  +  bjX^  when  nonindostrtal 

d.  btS  estinates 

y  -  54. 7«  +  .bbsxj  -  .659X2  -  1.77X2  "  ^-^^^4    *  ■  -^^ 
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III.  Hdre  coioplex  situations 

Indicator^  cbntlnubtisi  transformed »  poljmbiBilali and 
Interaction  variables  nay  be  combined  to  construct 
effective  summary  of  data 

!•    Exainpie;    Continue  Life  Expectancy 

a.  is  an  Interaction  relevant? 

1.    Estimate:    y  -  b^  +  b^^X^  ^  ^  ^3^i^2 

viiere     X.      Is  per  capita  Income 
x|     Is  Infant  mortality 
^1^2       multiplicative  interaction  tezm 

11.    OLS  estimates:  ' 

y  *  55.95  +  .004X    -  .bi58X    +  2.69xiO-53r  V 
1  2  I  .2 

_  2 

(all  are  significant)  R     -  .68 

b.  y  s  b    +  b,iNC  +  b^Mort  +  b^IND  +  b^NonlND  +  bcINC'Mbrt 

0  1  Z  5  D 

y  *  53.89  +  .OOBXj^  -  .09X2  +  5.97X3  ^  2.63X^  +  3.31xlO-5x^X 
(discuss  change  in  sign  in  national  status  varlabies) 

-  .68 

c.    y  ■  6^  +  b^loglNC  +  b^logMort  +  b-Ind  +  b^Sonlnd  + 
b^Ibglnc  * IbgMort 
y  -  46.53  +  9.331b^j^  -  W.BTlogXj      4.6X3      ^•°^4  * 
3.1  (iogSLlogXg) 


R^  -  .80 


2.    Another  application:    Seasonal  Shifts 


a.    Exampie:    Smoothed  D^C.  General  Hospital 
%  oiergency  admits  by  month  1970-1975 
(Note — recall  data  from  lecture  on  smoothing) 


^03 
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b.  Conceptual  nodei: 

C(PEAI  Season)  *  bg  +  b^X^  +  h^X^  -F  bgX^ 
vhere : 

Xj^  *  1  iff  summer  (Jurie^  July^  Aiig.); 
Xg  *  1  iff  fall  (Sept^  Oct.,  Nov.); 
Xj  «  1    iff  winter  (Dec.»  Jan.,  Feb.); 

bg  gives  baseline  for  spring 

c.  OLS  estimates: 

y  «  8.2  +  .UK^  +  .4X2  -  .2X3  -  .50 

Discuss  shape  of  effects;  comparative  level  (10) 
More  than  i'  indicator  variable 

a.  Example;    Inccmie  by  race  and  sex 

b.  Conceptual  inodel: 

C(InclRace,Sex)  *  bg  +  bj^^X^  +  b^Xj  (11) 

Where     X^    indicates  race      (two  categories) 
X2    indicates  sex        (two  categories) 

c.    Variable  definitions: 


Race  CX^) 

B  W 

^   1 


SexCXj)  M  b 
F  1 


MB 

MW 

FB 

1& 

FW 

il 

d.  Interpreting  b^: 

b^  Is  level  for  male  black 

bg      b^    is  level  for  male  white 
b^      b2    is  level  for  female  black 

bx  *  bi  -F  bx  is  level  for  female  white 

0  12 

e.  Alternative  race-sex  indicator  tising-  interaction  (12) 
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A88uine  single  variable  with  four  levels: 

 X  j 

0         0         0         Bale  blsck 
0        0        1         Biale  vhite 
0         10         f^inale  black 
10        0         female  white 

This  structure  does  not  assume  additivity  of 
race  and  sex  effects 

Rodel: 

Inc  *  bp  +  b^X^  +  bjX^  +  bjXg 
Interpreting  b- 

b       is  l^vai  of  male  black 

6 

bQ  +  b^  is  level  of  male  white 
bp     b2  is  level  of  female  black 
bg  +       is  level  of  fesnale  white 
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Topic  3:    Miimeric  indicator  variables—known  or  presumed  functional 
forms  for  an  X  variable—estimates  of  slopes 

i*    Constructed  variables 

i.    Presumed  linear 

a.  Bxample;    Life  expectancy  by  income  category— low, 
middle 9  high 

b.  Gonceptuai  model; 


e CLE  I  Income)  «  b^  +  hjX^ 


where:         -  Tl 

IS 


if  low 
If  middle 
if  high 


c.    OLS  estimates: 

Y  -  30.35  +  9.96X 


N  s  29 
N  *  19 

L  *■  71 


l2  .  .55 


»       X  S 

2.    Presomed  logarithmic 


a.    y  -  bp  +  bj^logX 


vhere     X  -A  low 

/2  middle 
yi  high 


b.  OLS 


y  ■»  39.94  +  40.341bgj^gX 

3.  Other  forms:    quadratic,  etc. 

4.  Time  trends 

a.  Linear  time— -Example :    DC  general 

b.  Hodel: 

C(PEA| Season,  Year)  -  bg  +  b^X^  +  bgX^  +  b2X3  +  b^X^ 

BOG 
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where  X^^Xg^X^  are  seasonal  indicators 
X^  is: 


Year 

1970 

0 

71 

1 

72 

2 

73 

3 

74 

4 

OLS  estimates 

Y  »  8.2  +  .4XjL  +  .^X^  -  .2X3  +  .OIX^ 

Other  functional  forms  possible:  quadratic 
logarithmic,  etc. 


So  7 
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Topic  4:    Combining  shifts  In  intercept  and  slope:  Splines 


ERIC 


We  can  construct  variables  and  data  to  handle  combinations 
or  special  cases .  Here  function  Is  continuous  but  deriva- 
tive Is  discontinuous. 


1;    Two  linear  time  trends:    Intersection  known^  slope 


unknown 


0       11       2  2 


(14) 


-4 
-3 
-2 
-1 
0 
0 
0 

d 


0 
0 
0 
0 
0 

1 

2 
3 


0 
1 
2 
3 
4 
5 
5 
5 


g 

0 
0 
0 
0 
0 
1 
2 


2.    Two  linear  trends:    Intersectlbti  unknown 

slopes  unknown 


(15) 


Need  third  indicator  variable  to  handle  Intersection 

''O      ^^1      '^2^2  +  V3 


Obs 


Data  Structure 


1 
2 
3 
4 
5 
6 
7 
8 
9 


1 
2 
3 
4 
5 
5 
5 
5 
5 


0 
0 
0 
0 
0 

i 

2 
3 
4 


0 
0 
0 
0 

1 
1 
1 
1 
1 


b    "  Intercept  of  line  #1 
G 

■  slope  of  line  j?l 

b2  "  slope  of  line  #2 

«  vertical  distance  between  line 
observation  piolnt 
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nd  #1  at  fifth 


bg*  vertical  distance 
between  lines  1_ 
and  2  at  the  5th 
observation 


(16) 


Data  Structure 


Xj-  1 

2 

X3-3 

h 

-  4 

X  X 

1 

h 

X 

3 

X 

4  

Z 

  J 

_  h 

b  1 

2 

3 

4 

0  d 

0 

0 

1  1 

2 

3 

4 

i  0 

0 

0 

2  1 

2 

3 

4 

2  i 

0 

0 

3  1 

2 

3 

4 

3  2 

i 

d 

4  1 

2 

3 

4 

4  3 

2 

1 

Model 

y  * 

^0  ^1^1 

b 

4  4 

where 

^1* 

X 

inax(X-X^i 

0) 

inax(X-X  , 
2 

0) 

\' 

maxCX-X^, 

0) 
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thea  b    is  slope  over  first  segment.  Other  b.  represent 
change^in  slope  from  preeeeding  segment « 

I.e.,  slope  for     X^^  <  X  <  is       ^1  *  ^2 

Xj  <  X  <  is       ^1  *  ^2  ''^  *3  ' 


Bio 
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 Lecture  4-3  

Transparency  Presentation  Guide 


Lecture 
Outline 


Beginning 


Transparency 
-  -  Number-  _ 


Transparency  Description 
Lecture  4-3  Outline 


topic  2 
Section  A 
l.b 

1.  b 

2.  b 

2.C 

Section  B 
1-b  ' 


2;b 

3;d 

Section  C 

2.  C 

3.  b 
3.£ 


2 
3 
4 
5 

6 

7 

8 


Conceptual  Model 

Scatterplot  of  life  expectancies 
Model  of  life  expectancies  for  nations 
Scatterplot  of  life  expectancies 


Goinbining  Indicator  and  continuous 
variables 

Scatterplot  of  life  expectancies  vs 
per  capita  income 

Scatterplot  of  life  expectancies  vs 
log  (per  capita  income) 

Geometrical  representation 


Smoothed  Seasonal  effects 
More  thaii  1  indicator 
Another  structure 


Topic  3 
Section  A 
4^a 


D.C.  General  Hospital  conceptual  model 


Topic  4 
Section  A 
1- 

2. 

3. 


Linear  Splihe~Ihtersectibh  known 
Linear  Spline—Intersection  unknown 
Multiple  peaks 
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Lecture  4*4.   Inference  about  Least  Squares  Coefficients 

Inference  about  Least  Squares  Cbef f icieiitj :    A  sampling  experiment 
itivolying  a  univariate  regression  to  study  the  random  nature  of  (i) 
coefficient  estimates 

Lecture-Con^eft^-; 

1.  Discuss  the  sampling  experiment  and  its  purpose  of  studying 
variability 

2.  Assumptions  made  in  a  multivariate  linear  regression  model 
Main  ^ofiic#: 

1«    The  sampling  experiment  and  analysis  of  coefficient  estimates 

2.  Regression  model  with  "well'-behaved"  data 

3.  Variance  of  least  squares  coefficient  estimates 
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Topic  1.    The  san^llng  experiment  and  analysis  of  coefficient  estimates 


I,    Basic  Issut:    Regression  coefficient  estimates  depend  entirely 
on  the  Vi)  observations 


1.    We  calculate  the  vector  of  coefficientestlmates,  b, 
using  only  the  (Hxp)  X  matrix  and  dependent  variable  f 


2.     If  we  add  or  delete  observations,  or  use  a  different  set 
bf  _N  bbservatibhs,  then  the  elstimates  will  most  certainly 
differ 


3.    Hence  coefficient  estimates  of  the  model  parameters 
depend  on  the  "sample"  of  N  observations 

II.     Problem:     How  do  the  estimates  of  the  b^'s  vary  with  the 

chosen  sample  (2) 

1.  Consider  the  following  example : 

a.  *  number  of  surgical  procedures  for  patient  i 
y^^  «  length  of  stay  (L0S)  for  patient  i,  in  days 

b.  Data  from  a  hospital  on  t  =  1,  2,  ...,  2435 
patients 

2.  Note  that  patients  have  between  0_ and  6  surgeries,  so  that 
we  have  a  natural  classification  into  batches 

3.  We  will  take  small  "samples"  from  the  large  data  set, 
fit  least  squares  lines,  and  study  how  a  and  b  vary  over 
the  samples 

III.     Solution:    A  sampling  experiment 

1.  Number  of  surgeries  and  length  of  stay  Is  more  linear  in  (3) 
log(y)  scale 

a.  increase  in  iog(LdS)  as  NStJRG  increases 

b.  Plot  shows  number  of  patients  with  j  surgeries ^ 
j  -  0,  1,  6. 

2.  We  study  the  7  batches  of  data,  Ibbking  at  box  plots  of 

log (LOS)  with  NSURG  fixed       '  W 

a.  Each  batch  seems  symmetric  about  a  modal  value 

b.  Interesting  to  note  that  average  Ibg(LOS)  is  less  for 
patients  with  1  surgery  than  for  patients  with  0  surgeries 

XVi.II.28 
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Parallel  schematic  plots  show  eqtxal  spreads ,  general 
Increase  In  log (LOS) 

Since  there  are  only  39_ patients  with  4,  5,  or  6 
surgeries^  we  cbmBihe  these,  three  batches.    This  schonati 
shows  linear  trend  better  than  earlier  plot 

Number  summaries  computed.  Note  how  X  «■  M  indicating 
symmetry I  and  how  a  "  ,75^  reasonably  constant 

Plot  of  cohditibhal  typical  values  shows  trend 

We  select  a  sample,  of  2142  patients  from  the  2435  to 
use  as  a  sampling  base 

a.  Note  the  scatterplbt  of  2142  patients 

b.  Slope  of  LS  line  is  .157,  slightly  less  than  .21 
for  entire  data  set 

c.  Intercept  is  similar  in  both  regressions^  2.1 

We  now  draw  100  samples ^  with  varying  sample  size  (-  25) 
where  the  percentages  of  patients  with  k  surgeries  is 
same  in  each  sample  as  in  the  entire  data  set 

a.  Stem-and-Leaf  display  of  the  100  estimates  of  a 
and  b 

i.    Note  how  the  intercepts  are  symmetric  about  z  2.0 
or  2.1 

ii.     Slope  Intercepts  appear  very  well  behaved 

b.  Number  summaries  of  the  estimates  show  "well-behaved- 
ness" 

i.     Intercepts  symmetric  about  2.1^  with  d  «  .239 

ii.     Slope6_6ymmetr_ic  about  .12^  slightly  less  than 
the  .157  fbrthe  sample  base ^although  b  »  .228 
shows  that  .12  is  hbt  tbb  small 

Hbw  db  the  LS  estimates  cbmpare  with  resistant  line 
estimates? 

a.  Data  have  large  spread ^  so  that  resistant  lliie 
estimates  may  be  slightly  clbser  tb  b  "  .157 

b.  Intercept  estimates  have  larger  spread  than  with 
LSy  mbre  outliers 

XVI. II. 281 


mm 


c.  Slope  estimates  also  have  more  spread^  although 
mode  «  .IS  ±8  more  apparent 

d.  Niixnber. summaries  ahdv  that  b's  have  mean  of  .155^  (12) 
very  close  to  the  "correct"  value 


IV,    Conclualon:    Sampling  shows  how  estimates  of  regression 
parameters  vary 

1.  When  the  observations  used  Ih  the  model  do  hot  include 
all  the  observations,  i.e.  when  "sample  size"  ^  "popu- 
lation size",  estimates  %d.ll  vary  around  the  "true  values'* 

2.  Hence,  there  is  a  degree  of  randomness  inherent  in  the 
estimates 

3.  Estimates  appear  quite  well*-behaved 

A*    LS  estimates. not  quite  as  accurate  as  Resistant  Line 
estimates^  although  spread  is  certainly  less 


B2G 
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Topic  2.    Regression  model  with  •Vell^tsehaved"  data 

I,    Basic  issue:    Definition  of  well-behaved  regression  data 

1.  Since  Data  -  Fit  +  Residual,  the  definition  of  well- 
behaved  regression  data  begins  with  the  residuals 

2.  Compute  Residual  «  Data  -  Fit  -  9^  and  examine  as 
a  batch 

II.    Problem:    What  is  a  well-behaved  batch  of  residuals? 

1.  First  of  all,  the  linear  model  must  ''fit",  so  that 
is  large  (near  1),  and  residuals  are  small 

2.  Residuals  should  be: 

a.  Hbmbscedastic — variance  of  residuals  should  remain 
constant  as  X  increases — easy  to  envision  in  2 
dimensidns 

b.  A  well  behaved  single  batch — sySetric  about  0, 
with  95%  between  2  and  -2 

c.  Devoid  of  all  patterns 

Let  Cf2  m  Variance  of  Residuals,  estimated  by 

A  ::ilue  the.  standard  error  of  the  residuals  (V^  f or  the  (13) 

.  00  r:  2fesdlons 

a.    Ks  .?!  of  the  o's  «  .785 

t.         LOT  eti-ire  sample  base  «  ^777  so  that  d's  are 
rd,  ghly  symmetric  about  the  "correct"  value 

5.    With  well-behaved  univariate  data,  resistant  line 
estimates  should  equal  LS  estimates 


637 
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Topic  3.    Variance  of  least  squares  coefficient  estimates 

I.    Basic  Issue:    Theoretical  fbrta  of  variance  of  least  squaries 
estimates 


1.  Nov  that  we  know  that  the  hatch  of  residuals  has  variance 

how  do  the  parameter  estimates  vary? 

2.  Let  be  the  vector  of  least  squares  coefficient 
estimates 

3.  b^g    is  £  "rahdbin"  vector  >  varying  about  b,  the  "true" 
tegression  cbeffieiits 

-  -  -  —        -  2 

4.  Var  Cbj^g)  depends  bh  X  and  a 

II,    Problem:     Interpretatibh  bf  VarCbj^g) 
1.  Definition: 

Var(b^^)  «  a^Cx^xr^ 


2.  Diagonal  terms  of  Var(b^=.)  are  variances  of  individual 
b. 

3.  Off  diagonal  termsj   (i^j),  are  cbvariances  of  b^  and  b^ 


A.    Transparency  shows  btitput  from  LS  regression  of  log(t6S)on 
NSURG  for  entire  sampling  base 

a.  Nbte  cf^CXTC)*"^ 

b.  Various  dther  quantities,  R^,   a  =  standard  error,  "t'' 
will  be  discussed  in  detail  in  next  lecture 
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Standard  Errors  of  the  Residuals 
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Variation  of  Coefficients 
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Zf^cture  4-2^  Model  vith  Least  Sqtuires  Estimates  in  Well^-Behaved  Batches 
Model  vith  Least  Squares  Estimates  in  Well-Behaved  batches:  Evaluatittf, 


1.  Discuss  assumptions  of  linear  model  and  the  optimality  of  the 
estimates 

2.  Testing  how  well  ths  model  "fits*' 

Math  Topics; 

1.  Covariances  and  variances  of  variables 

2.  E^raluatihg  the  model 


the  model 


(1) 


Module  II 


Topic  1.    Covar lances  and  Variances  of  Variables 

I.     Basic  Issue:    How  do  ve  measure  how  "related*'  a  set  of 
variables  are 

1.  Develop  "pair-wise"  taeasures  of  relsflons.    There  are/| J 
measures  for  a  set  of  p  variables  \  / 

2.  The  measure  comparing  X-  and  X    telis  how  similar  these 
2  variables  are,  independently^ of  all  other  variables 


II,    Problem:    What  is  the  best  measure? 

1,  Seek  a  dlmensionless  quantity — no  units 

2.  Measure  should  have  a  maximum  and  a  minimuin  value  to 
aid  us  in  our  assessments 

III.     Solution — Covariances,  Variances,  and  Correiations 

1.  Covariances,  in(tinits  of  Xp  x  (units  cf  X  X  tell  by  how 
much  X^^  and  X  simultaneouF^iy  vary  fror  their  means, 
for  alP  k  =  1,  2,   . . .  ,N 

2.  Variances,  in  units  of  x|,  £ell  by_how  much  the  bbser- 
vatidLis  of  X^  differ  from    X^,  in  squared  deviations 

3.  Correlations — our  desired  measure  of  relationships — are 
ratios  of  covariances  and  variances 

IV.  Method 


1.    Definition  of  Covariances  and  Var: 


2.  The  (pxp)  matrix  of  covariances  (off-diag.)  and  variances 
(diagonal)  is  called    Z  «=  **Variance-Covariance' 
matrix 

3.  Correlations 

-  '  -  f^y'  '-^^ 


CjitM 


4.  R,  the  matrix  which  has  ones  along  the  diagonal  and  r.- 
as  off-diagonal  elements ^  is  called  the  diagonal  matrix. 

5.  Correlation  of  -1  or  1  ^ 

and       are  linearly  related 


6.     Correlation  of  6 


if  X^  and       are  veil-behaved,         "  '^i 

X    are  ''independent",  i.e.,  unrelated 

J 


( 
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topic  2.    Evaluating  the  Model 

I.    Basic  Issue:    When  does  least  squares  produce  "good"  estimates 


1,  Remember  that  LS  is  just  one  technique — there  are  others — for 
estlmatioh 

2.  LS  works  well  when  the  data  adhere  to  several  assumptions 


II.     Assumptions  necessary  for  LS 

1.  Hodel:     9    -  b    +  b-X      +  b^X      +  ...  +  b.X^  (2) 

i        0        i  li         2  2i  P 

2.  Residuals:    y ^  "  y^ 

3.  Four  assumptions; 

a.     y^  must  be  linearly  related  to  the  n  independent  variables 

b«    Batch  of  residuals  must  be  well  behaved 

c.  Residuals  must  be  independent  of 

d.  Variance  of  batch  of  residuals  is  a  constant  »d 
These  assumptions  must  be  true 


4..     If  (a) -(d)    are  true,  th^n  by  the  Gauss-Markov  Theory, 
LS  estimates  are  optimal  in  the  sense  *a::t  the  quantity 
Z(y,-y^)2  is  minimized  i.e.,  estimates?  ha^'e  minimum 
variance 

5.     One  implication  of  these  assumptions  is-that  the.  ^ 

coefficients  are  "well-behaved"  i*^*  >::i^z*^z^°??-§  z 
each  model  n  times,  with  n  different  samples,  the  batch 
of        values  would  be  well-behaved — unbiased  and  consistent 

6..     Note  how  well-behaved  the  standardized  b^  coefficients  from 
the  sampling  experiment  are  (3) 


III.     Problem:    How  do  we  determine  whether  these  assumptions  hold? 

1,  We  have  various  measures  at  bur  disposal: 

a.  multiple  correlation  coefficient 

b.  t-statistics 

c.  TlidrougH  examination  of  resid  -ils 

2.  We  discuss  a  and  b  here,  leaving  c  until  next  lecture 


6ni 


VTT*r — T  T  fa  ^  o 


ERIC 


IV.  Method 

!•        •  square  of  the  multiple  correlation  coefficient  (4) 

2.    R  »  correlation  between  Y  and  the  linear  combination  of 
X* 8  which  maximizes  R|    the  combination  is  the 
regression  equation 


3.    t-statistic — determineB  whether  a  variables  belongs  in  the 
model 


4.    If  |t- I  >  3,  variable  X-  is  important;  if  t^  near  0, 


variable       can  be  Ignored 

2     J 

5.    R    and  t  statistics  for  the  100  samples  (5)  (6) 

regression  output  (7) 
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Lecture  4-&    Evaluating  the  Model 


EvaliiatlngL  the  Model:  A  thorough  examtnation  of  the  Residuals  to 
determine  how  well  the  model  fits 


1.  Analysis  of  Residuals 

2.  Analysis  of  Length  of  Stay  of  patients  in  a  hospital 
Main  Topics; 

1*  Looking  for  patterns  in  the  residuals  via  scatterplbts 
2.    Applying  bur  inferential  procedures  to  an  example 

(There  are  no  transparencies.) 
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Topic  !•    ilaalyslt  of  Residuals 

X.    Bselc  Issue:  _  Bow  can  ve  use  the  residuais  as  a  batch  to  find 
violations  of  assumptions 

!•    Ve  have  tbeasures  for  overall  assessment 

a.  R^:    Coefficient  of  Determination 

1.    Function  of  the  Squared  Residuals 

11.    fells  what  fraction  of  the  total  variation  Is 
''explained"  by  the  fitted  tine 

ill.    Conprehenslve  measure  of  fit 

b.  t  statistics:    Ratio  of  estimate  to  its  standard 
error 

1.    Indirect  function  of  Residuals  depends  on  o^  ■  Var 
Residtials 

11.    fells  whether  a  given  regression  coefficient  is 
non-zero 

111,    If  -3  <  t  <  3,  coefficient  is  essentially  0,  i.e., 
the  variable  doe*?  not  nelp  to  "«plain''  y 

iv«    Rote:    to  increase  t  statistics,  one  usually 
decreases  S.E»  of  b 

S.E.Cb^J  -  ra^^^X)-lii)1^2.   (1,1)  clement 

A.  We  can  either  decrease 

B.  Or  increase  (X^X)      spread  X's  out  over 
wider  range 

2.    These  measures  are  "gross"  in  that  an  assessment  of 
violations  of  a88ui!q)tloh8  is  not  all%Afed 

3i    We  need  to  examine  railduali  further  to  determine 
whether  they: 

a.  Are  well  behaved 

b.  Are  hombscedastlc 

c .  Are  independent  r^f^ 
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il.    Problem:    Bow  ihould       examine  them 
i •    We  ces  only  conclade  that  either 


Wim    The  liS  asstimptionB  appear  to  be  violated  in  some 
specified  way 

b.    Ihe  tS  aasamptions  do  not  appear  to  be  violated 

2.  Kote  that  b,  does  not  mean  the  as  stations  are  correct- 
It  Mans  that  given  the  data  that  we  have  seen,  we  have 
no  reason  to  conclude  that  they  are  violated 

3.  We  examine  the  reaiduals  graphically 

4.  Scatterplots  are  easy  to  ^ke^  and  quite  revealing 

III.    Solution:    Principal  ways  of  plotting  residuals 

1.  Stem-and-Ieaf 

2.  in  time  sequence  (if  relevant) 

3.  Against  the  fitted  (conditional  typical)  values  y 

4.  Against  the  independent  variables 

5.  Any  other  sensible  ways 

iv.  Methods 

!•  Stem-and-Leaf 

a.  Stem-and-leaf  the  residuals       *"  ^i  ^  ^i  * 
single  batch 

b.  Display  should  resemble  a  well-behaved  batch 

c.  If  not,  the  weil-behaVedness  assumption  is  violated, 
and  one  should  determine  exactly  how  by 

-    2 

i.    Looking  at  X:and  a 

ii.    Examining  outliers  carefully 

2.    Time  sequence  plots 

a.    if  data  are  time  series,  i.e.  gathered  over  time,  we 
should  plot  the  residuals  (y)  vs  corresponding  value 
on  the  time  scale  (X) 

XVi*lI*3i5 
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h.    Ther*  «re  5  Characteristic  shapes  to  these  plots; 
iShape  1.    Random— Dm tred  pattam 


Shape  2.  Wedge 


Shape  3.  Linear 


Shape  4.  Quadratic 


Shape  5.    Trignometric — Sine  patterns 


c.    These  shapes  tii5>ly  the  following 

i.    Shape  2-  Heteroscedasticity  

Use  weighted  tS  (next  lecture) 

ii.    Shape  3.    Litiear  tisae  term  needed  ±n  model 

iii*    Shape  4.    Quadratic  time  term  and  linear  time 
term  needed 

iv.    Shape  5-    Tough  iuck^Seasonalitles 
Try  indicator  variables 
Residuals  are  oot  independent 

Plots  against  the  fitted  values 

a.  Shape  2.  Heterpicedasticity 

Transform  y 

b.  Shape  3.    Error  iii  Analysts  ^  ^ 

Model  Incorrect— did  you  leave  out  b^l 


c.  Shape  4.    Model  Inaccurate 

SeSs**^**^"''"*^  interaction) 

d.  Shape  5.    ?    Residuals  not  independent 

Plots  against  the  Independent  Variables 
1  per 

Shape  2.  Beteroscedasticity 
Transform  y  rr 

b.  Shape  3.    ProBable  error  in  calculations 

Linear  effect  not  removed 

c.  Shape  4,    Need  eactra  higher  order  terms  in 

a.  Shape  5.    ?    Residuals  not  independent 
Other  residual  plots 

J^^f^""f^«,^°°e  from  different  proceas^^^  (1-10: 
bSl^t  separate 

b.  If  considering  a  new  Independent  variable,  plot  It 
against  the  residuals 


<(. 
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topic  2.    Applying  oar  infe^eStial  procedures  to  an  exaipie 

I.    tfffrf'l*-  Length  of  hospital  stay"  Lave  & 

Lelnbardt 

Examine  Tables  2,  3,  4,  5 
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Lecture  frobtems  vtth  Laist  Squares  Esttmatibn 

Problems  with  Least  Squares  Estimation :_  Effect  of  invalidated 
a8liuiq)tion8  on  the  coefficient  estimates 

Lecture  Content : 

1.  Small  number  of  degrees  of  freedom 

2.  Robust  ^d  LAR  regression 

3.  Ridge  regression 

Main  Topics: 

1.  Over  fitting—more  variables  than  observations 

2.  Kon-veli  behaved  batches  of  residuals 

3.  Coiitnear  independent  variables 

(There  are  no  transparencies) 

Note:  This  lecture  covers  advanced  topics  and  should  be  con- 
sidered optional. 
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Topic        Over£lttlng-«ore  varlaBlei  thaii  oBservattons 

t,    liiic  iMue:    Effecti  on  LS  estlinateB  ^en  N  is  small 
relative  to  p 

1.  Define  "degrees  of  freedom"  as  N-p 
IJeslre  N-p  »  e 

2.  Suppose  N-p  <  6 

3     When  N  ■  p  ^  hive  a  "perfect  fit":  Jegresslbn  line 
cospletely  describes  the  relationship  between  X  and 
residuals  are  0,  r2  s  1. 

4.  If  N  <:  p.  thin  ve  are  In  trouble.    We  wdll  S5t^be  able 
to  estliite  m  the  coefficients,  only  N.  functions  of 
them 

5.  statlstlcalmithodology  fOr  handling  thts  situation  Is 
underdeveloped,  and  quite  unsatisfactory 

II.  Solutions 

1.    We  can  alvayi  delete  variables  ontil  N  Is  larger  than  p 

2     Or    we  can  forget  about  fitting  iultlple  regression^^ 
Sdels.  Sd  limine  each  of  the  P  variables  as  a  single 
batch 

3.    We  then  try  to  combine  these  single,  separate  analyses  to 
form  Bone  impression 

4     die  we  cai  conAlhe  variables,  for  Instance  by  forming 
interactionis,  sO  that  p  is  reduced 
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Topic  2.    fion-^11  behaved  batches  of  residuals 

Basic  issue:  How  do  departures  fro*  •«well-tehavedness" 
iffect  tS  estimates 


i. 


i. 


2. 


Most  cbisndn  prbbi^  iS  least  sq'tares  regression  i? 
!utiyi^vilu«;  .kewness  is  mother  frequently  observed 
departure  from  well-behavedness 

fieteioscedastic  residuals  best  |rMted  via  transfonnation 
or  Weighted  Least  Squares  Csee  below) 

3.  tack  of  independence  in  the  errors  also  best  treated  with 
Weighted  Least  Squares  ^ 

6.    Outliers?    We  also  use  a  "fancy"  version  of  weighted 
least  squares 

II.    Probi^:    How  do  we  use  weighted  least  squares  to  li^rove  LS 
estimates? 

1.    We  assigh  a  weight  to  each  observation  that  tells  how 
important  that  observation  is 

2  Ordinary  Least  Squares  assigns  weights  of  unity,  to  every 
observation!  henCe  large  outliers  receive  tne  same  weight 
ss  observations  «*iich  have  small  residuals 

3  We  w5uld  like  to  assign  sialler  weights  to  these  larger 
outliers,  so  that  they  become  less  important 

4.  we  iinerate  a  matrix  W,  that  is  diagonal,  with  weights 
iydg  between  0  and  1.    W  is  (NxN)  ,  the  (i,i)th  diagonal 
element  is  the  weight  that  we  assign  the  ith  residual 

a.  If  ith  residual  is  small,  w^^  near  1  :    full  weight 

b.  If  ith  residual  is  quite  large  in  absolute  value, 
w^^  Sear  0  '•  little  weight 

5.  How  do  we  determine  these  weights? 

a     If  wi  know  *^at  they  should  be,  we  have  no  problem ^ 
ierely  form  W,  perform  the  WLS  calculations  given 
below*  and  everything  will  be  fine 

b      If  we  have  no  idea,  we  uie  "Robust  Regression" 

techniques,  or  Least  Absolute  Residual  Regression 

679 
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III.  Soluttotts 

i.    Robust  Regression 

i.    He  find  the  taatrlx  W  In  m  ^ter^tlve  procedure 
by  ^nlmlslng 


1-1          s  / 

i. 

»  1th  row  of  X 

±±. 

S  Is  an  estimate  of  a 

ill. 

Is  bur  weighting  function  p^- 
Ve  describe  our  functions  in  terms 
derivative  of  p. 

Y_ 

of  the 

iv. 

Our  weights        are  *(Z)/Z.  These 
values  we  plate  on  the  diagonal  in 

are  the 
W 

b.    Ejcanples  of  ^i') 
i.    Least  Squares 


note  how  large  residuals  are  not  weighted 
downward 


It 


*(r.)  -  

^     ^      r    ■  1th  residual  from  some 

r±  1     "Initial"  fit 


BSO 


t 

6 
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i.    We  flSd  the  mttix  W  In  an  iterative  proceaure 
by  minlniizlng 


1 

±•1^8  / 

±.         -  ith  row  of  X 

±±.    S  ±8  an  estimate  of  a 

±±±.  is  bur  weighting  function       -  ? 

We  describe  our  functions  in  terms  of  iff  the 
derivative  of  P* 

i^.    Our  weights  w.^  are  *(Z)/Z.    These  are  the 
values  we  plate  on  the  diagonal  in  W 


b.    Eacanples  of  $(*) 
i.  Least 


note  how  large  residuals  are  not  weighted 
downward 


ii 


*(r.) 

_  _  ^     ^     f    i  ith  residual  from  some 
ri  1     "Initial"  fit 


BSO 


t 

6 
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ii;  Sine 


{ 


_C  SiniZfClf  \Z\  C 
Q  i  otherwise 


X7 


Note  how  large  residuals  get  we 
downward  to  0; 


1±±.    Other  functtons  include  Huber 


c.    We  find         and  make  the  V  matrix  and  compute 


6  -  ^'WX)"^  X'W  y 

d.    We  then  find  the  residuals  from  this  fit*  weight 

them  again i  and  recompute  B.  Stop  when  coefficients 
converge 


teast  Absolute  Residual 


^(Z)  -  Sign 
1-1 


W  -  Z 


BSl 
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ibiSitite  residuals  a±ke  finding  medians) 
Aisb  an  Iterative  procedure 
Weighted  Least  Squares 

heteroscedastic  tendencies 
b.    We  then  estlinate 

.  WLS        -  -  -  -  -  - 

This  is  hot  an  iterative  procedure,  since  we  assume 
E  known 
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Topic  3.    Cbllinear  Independent  Variables 

I.    Basic  Issue:    What  effects  do  honihdependent  ^'ixidependent 
variables"  have  on  IS  estimates 

!•    9e  should  always  examine  R,  the  correlation  matrix  of  the 
X's 

2.    If  any  off -diagonal  element  of  B  is  near  -^1  or  1^  then 
ve  have  problems 

3*         can  ran»dy  this  sitwtim^  mtlttcblltnaarlty^ 
oniy  by  daleting  one  of  the  2  offending  variables 

4.  If  the  situation  is  more  complicsted^-^mahy  correlations 
>.5  or  <-.5^  and  ve  cannot  delete  the  necessary  variables 
then  ve  can  use  *llidge  Regression" 

5.  When  there  are  many  related  X  vectors^  ^"X  vill  be 
difficult  to  invert 


II.    Solution:     Ridge  Regression 

i.    Ridge  regression  operates  directly  on  making  it  more 

"stable"  and  hence  ihvertable 


2.    Mbdel:  :     §  -  (X"X  +  kD'^X^y 


3.    k  lies  betveeh  0  and  1  and  is  added  to  the  diagonal  of 
X^  to  increase  its  stability 


es3 
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Homewbrki  Unit  4 

let  ^-  Iffe  *  Hedlan  Income  of  fmllles  and  unrelated  individual'. 
X    ■  FOP  -  Population  of  each  tract 

*  SWHITE-PCT  -  Percent  6f  population  of  each  tract  that  Is 
iiCQirtilte 

^  m  YE1»S-ED-C  -  Average  numbed  years  of  education  per 
individual 

X    i  PCT-CT65Y  -  Percent  of  population  greater  than  65  years 
of  age 

r  K  T,i«4-  ^WoK  *  AMihst  V.  Comnent  on  these  4  plots.  Are  the 
pointlSuls'SSll^^r  dHn,  of  the  independent  v„^^^^^ 
require  transformation?    If  so,  transform  and  plot  again. 

a>5    Fit  a  multiple  regression  line  to  these  data.    Does  the 
model  Sake  sense?    Ckwaaent  with  respect  to: 

(i)    Bhits  of  analysis  of  the  X' 8  i  -.i*„.t.4n« 

(iij    "Causality"  and  the  raderlying  theory  of  the  situation 
being  modeled. 

(c)  calculate  and  e^camine  the  residuals  as  a  single  hatch  of  data. 
What  do  they  tell  you  d>out  your  fitted  line? 

  2 

(d)  Calculate  and  exmine  R  .  Cdninent. 

(e)  Bsing  the  linear  model  with  any  additional  information  _^ 
^  ^    lithired  in  (b),  (c),  and  (dj,  comment 

lations  of  the  Sodel  relating  median  fmily  income  to  the 
ieveral  X  variables i 

?f>    Could  we  have  fit  another  line  using  *  wffet  of  these  4 

Sdependent  variables  and  still  have  obtained  a  good  fit?  If 
iO,  «bat  subset? 

If  you  could  have  m        *>f  X  variAliS  (not  necesfarily  those 
taclSaed       the  9  l^DAP)  to  predict  median  f«nily  income, 
«hi<^  variables  would  you  choose  and  why? 

6S4 
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2i    Stored  tn^CMU-DAP  is  an _archiye  named  COBBDOUGLAS  with  4  variables, 
one  observation  per  states 

i)  YMJ^^ 

ii)  CAPITAt 

iii)  LABOR 

iv)  EStABL 

^ese  data  are  economic  data  used  by  many  economists  to  "predict"  the 
total  value  added  (in  dollars)  to  a  subset  of  the  economy  via  the 
Cobb  Douglas  production  function  model: 


ifir 


where     V  »  Value  added  (miliions  of 

L  »  Labor  CnlHicms  of  man-hours) 
K  «  Capital  Services  Flow  (mllllohs  of  $) 
N  *  Number  of  Establishments 
,  AiOfi  *  Parameters 

(a)  Reexpress  this  model  In  the  usual  multiple  regression  form.  Plot 
each  (transformed)  Independent  variable  against  the  (transformed) 

•dependent  variable.    Cotinnent  on  these  plots. 

(b)  Calculate  the  LS  regression  (fit  a  multiple  regression  to  the 
transformed  data).    Coninent  on  this  model  with  respect  to: 

(1)  causality 

(11)    possible  dependence  axndng  the  Independent  variables  • 

(c)  Calculate  and  plot  the  residuals.    What  do  they  tell  you  about  yout 
fitted  line? 

_2  — 

(d)  Calculate  R  .    What  does  this  tell  you  about  your  fitted  line? 

(e)  Iftiat  are  the  elasticities  of  value  added  with  respect  to  Labor/ 
establishment,  and  Capital  Services /establishment,  and  what  do 
they  mean? 

(f)  Based  on  this  model,  should  the  new  (ihcomihg)  administration 
concentrate  more  pii  decreasing  unemployment  (i*e.  increasing  L), 
or  pumping  money  into  the  economy  (i.e.  J-ncreasing  K)?  (or, 
i^at  cbmbiriatioti  of  both  together?).  B^r- 
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3.    In  '32!j-DAPit  ther-i  Is  ah  archive  named  GA!5,g5»±th  the  foiiowing 
variables i  one  duservatioh  per  state: 


*-  -i 

lioo^'ct'o^oH    tJni^oc — Riii?PJ5  -    And   tT^ictfi   in  1973  • 

"VEHICLES" 

State  gaso lire  tax  per  gallon,  in  cents,  in  1973. 

"GASTAX" 

Hotor  Fuel  cbhsumptiohj  in  thousands  of  gallons,  in  1973. 

"CONSUMPTION" 

Population^  1970  census. 

"POP" 

Population  density^  1970  census,  land  area  only. 

"POPPENS" 

vij 

Per  capita  income^  in  1973. 

"PCINCOME" 

The  data  shpvr  inotbr  fiiel  cbhsutnption  in  1973  along  with  5  other, 
possibly  relatkid^  variables. 


(a)  Plot  each  variable  against  fuel  consumption.    Comment  on  these 
plots.    Do  all  of  these  variables  appear  related  to  fuel 
cohsumptibh? 

(b)  Calculate  the  LS  regression  (i.e.  fit  a  multiple  regression 
tb  the  data).    Does  the  model  make  sense?    Comment  with 
respect  to 

(i)     causality  and  the  underlying  theory  of  the  situation 
being  modeled 

Cli)    pbssible  dependence  among  any  of  the  "independent" 
variables 

Calculate  and  plot  the  residuals*    tlhat  do  they  tell  you 
about  your  fitted  line? 

1  _      2  -  -  -  -  L   1  ' 

Calculate  R  .    Wh4t  does  this  tell  you  about  your  fitted  line? 

From  your  discussion  in  (65 >  decide  upon  the  two  "independent" 
variables  which  you  feel  yield    the  most  "reasonable"  model 
from  H  causality/dependence  point  of  view.    Repeat  parts  (b) * 
(c) ,  and  (d)  for  this  "reduced"  model..  Compare  these  results 
to  those  found  for  the  "complete"  mociel  arid  discuss. 

Does  it  bother  you  to  have  sbmedata  from  1970  arid  some  from 
1973?    How  do  you  think  this  affects  the  "validity"  (such  as  it 
is)  of  the  models  ("complete"  arid  "reduced")? 

if  you  could  have  any  other  "irideperiderit"  yariable(s)  ybii  desire 
to  use  to  "predict**  fuel  cbrisumptiori^  which,  one{s)  would  you  add? 
Which  of  the  5  provided  would  you  retairi?    Write  your  model  (but 
do  not  attempt  to  calculate  any  parameters). 
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Module  11 


MSiny  people  dream  of  "beating  the  stock  mar^^  able  to  _ 

predict  accurately  its  behavior  (and  hence  being  able  to  buy  low  and 
sell  high) . 

Cbezi)  Ganma^  and  Kandali  proposed  the  following  Taodel  to  predict  the 
London  Stock  market: 

Y^.6o-h$,t-he2^^^^g-hB3X2,t-7        :  , 

i^ere  t  *  tline  (in  quarters,  from  1952/3  to  196774,  t  -  i  for  1952/3, 
etc.)  . 

«  financial  Times  ordinary  share  index  at  time  t 

■  United  Ringdoin  car  production  at  time  t 

"  Financial  Times  commodity  index  at  time  t 

Note  that  two  of  the  "in4^ndent**  variables  are  lagged,  that  is,  the 
values  at  sotae  previous  time  are  used  to  predict  the  next  value  of  Y, 
(The  data  ar6  stored  however  as  Y^,  Xj^^-g*         *2,t-7^  * 

The  data  are  stored  in  DAP  under  the  archive  STCiaSCt,  with  variable 
names 

SHAREIND 
CARPR0D-LAG6 

c6jm:nd-lag7 

(a)  Plot  each  "independent"  variable  against  Y.    Comment  on  these 
plots • 

(b)  Calculate  the  tS  regression  (i.e.,  fit  a  multiple  regression 
line  to  the  data).    Does  the  model  make  sense?    (Comment  with 
respect  to: 

(i)  Causality 

(ii)    Possible  dependence  among  any  of  the  "independent 
variables") 

(c)  Calculate  and  plot  the  residuals.    What  do  they  tell  you  about 
your  fitted  line? 

(d)  Calculate  R^.    What  does  this  telL  you  about  your  fitted  line? 

(e)  Based  upon  your  answers  to  (b)^  (c) ^  and  (d)  above,  would  you 
be  willing  to  use  this  model  to  "play  the  market"?    Why  or 
yihy  not?  6S7  . 
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Homework  Vnit  4 
Solutions 


(.)    MFU  v..  fOf  is  plotted  in  Figure  A-    An  .pproxi«ately  linear 
p«ttera  ti  diicertiable  (note  the  line). 

^  v..  MWHITE-PCT  is  plotted  in  Figure  B.  Transformation 
^log  ornelitive  reciprocals  is  clearly  required. 

fnr^Bf  KHHfl'-PCT  desonitrates  less  linearity. 

gUtiofih^.r^  ^W-rth"iiner '°  " 

The  pattern  is  roughly  linear  (note  the  line). 

VFH  va    FCT-GT65Y  is  plotted  in  Figure  E.    Again,  transforma- 
Sonly  loglr  SgatiJe  reciprocals  .eems  necessary. 

MFU  v..  WO  «>t^CT-CT65Y  is  plotted 

linear  pattern  is  ««««'^"V<"|g  J^5|^?^3  uftle  ±m- 
^^mSrt^^^^^^^  ^e  Will  Stic.  With 
the  fcOG. 

i.apr6«l  isa  J»8  irdiffHSrio  lnt*rp«t.  oily 

Hote  that  to  *oth  pases  we  co«  tocreaie  incoae), 

whereas  Ae  5«««";^f 'y°Lr"lthoa^  different  in  each 
'  over  65  ««  f Slfwe  trS        these  two  var- 

^^^^ 

A  unit  anilyiii  is  not  really  appropriate  to  this  probleS. 

^ich  ^r^S2icP  Sr^r^"^4edi.f  incite 

ILllIrS'S'^I^Ul  vl"rtl«  slasS^  in  differing  units 
1#  UopU  plr  SHt'  niswhlte/lOO  people  is  tract,  years  of 
2uStli:/Sd"  people  over  65/100  people  in  tract). 
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Dependence  mndng  the  "independent"  variables  ahould  be  ' 
investigated  further.    Fox  example: 

^  2  of  people,  older  than  65  nay  be  correlated  with  years 
of  education,  due  to 

*  current  adult  education,  trends 

the  fact  that  most  of  the  elderly  are  women i  and  higher 
education  of  women  was  (allegedly)  more  prominent  in  the 
1930'8. 

-  older  people  have  had  more  years  in  which  to  be  educated! 

^  %  nbn^rtiite  may  be  highly  (negatively)  correlated  with  years 
of  education 

-  if  one  believes  nonwhite  mortality  rates  to  be  higher  than 
those  for  whites,  %  older  than  65  may  be  highly  negatively 
correlated  with  %  nonuhite* 

eausaiity  is  Mother  question.    It  does  make  a  certain  amount 
of  sense. to  suggest  that  changes  in  the  age i  race,  and  educa- 
tion levels  of  the  population  "cause*'  changes  in  the  median 
income  level.    We  certainly  do  not  expect  population  to  be 
causal • 


(c)    A  6tra-and-leaf  display  and  a  boxplbt  of  the  residuals  are 
shown  in  Figure  H.    The  residuals  appear  well-behaved<-- 
symetric  about  medtra  0^  small ^  and  with  few  outliers. 


(d)    R    for  the  transformed  data  Is  ^ 7863  (Figure  G>.    Ihe  model 
fits  rather  well,  from  an       point  of  view. 


(e)    Median  income  appears  to  be  highly  positively  influ- 
enced by  years  of  education ^  and  negatively  influenced 
by  both  %  nonwhite  and  Z  over  65 ^    This  suggests  that  the 
GAUSES  for  lower  hbhwhite  and  elderly  incomes  should  be 
investigated  in  more  detail,  and  policies  to  benefit  those 
groups  (minority  education^  training  programs i  more  efficient 
use  of  elderly  resources,  etc.)  should  be  examined •  Before 
any  such  policy  is  adopt ed| however,  all  of  its  implications 
should  be  thoroughly  studied. 


(f)    Since  population  appears  hot  to  greatly  affect  the  model, 
this.  X  variable  might  be  dropped.     (If  the  resulting  model 
is  very  different^  however,  this  move  must  be  reconsidered.) 
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(g)    8oM  posiibilltlM  night  be 

-  propirty  tax  rite  (If  this  varies  6y  tract) 

-  %  of  tract  soned  for  apartments  (this  however  could  be 
doubli^adged  In  the  case  of  iuacury  apartments) 

-  t  of  population  In  professional  fields  (i.e.  lawyers,  doc 

tori*  itc.)  _„       ,     J  ^ 

-  population  density  (usually  Inversely  related  to  median 

tncomi) 

Of  course,  there  are  many,  many,  other  possibilities. 
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Cobbdbuglas  Problem 

(•)    We  lave  (|)  »  A(|)  ^)  _ 
taking  natural  logs:    ln(|)  *  in  EAc|)  ^)  3 

*  In  A  ^  ln{g5    +  ln(g> 

*  In  A  +  Of  ln(§)  +  P  ln(§) 

V*  *  Ih^p    •  bQ  +  b^L*  +  b^k* 

the  ueual  iultiple  regression  form. 

(We  prefer  natural  log  for  theoretical  reasons^  The 
analysis  could  also  have  been  done  using  base  10  logs.) 

Plots  of  V*_V8.  t*  and  V*  vs^  K*  are  shown  in  Figures  A  and  B* 
Both  plots  arerCTarkabiy  linear  fbr  real  data.     (But  note 
that  since  we  are  dealing_vi^    a  model  baised  on  a  specific  and 
well-defined  theoretical  model  we  would  not  perform  any  trans*- 
formations  on  V  »  L  »  or  K  . 

(b)  The  ^ffiG  output  is  shown  in  Figure         We  are  ''predicting" 
or  ''modeling**  value  added  (per  establishment)  in_  dollars 
by  manftours  (labor)  per  establishment  ahd_  capital  flow 
(dollars)  per  establishment.    So tethat  all  variables  are 
measured  in  units  PER  ESTABLISHMENT.    The  underlying  economic 
theory  considers  labor  and  capital  services  flow  as  causing 
value  added »  and  not  dependent  oh  each  other  (although  one 
certainly  tends  to  increase  both  in  order  to  increase  value 
added) • 

(c)  A  6tem«*and-*leaf  plot  of  the  residuals  is  ihpwn  in  Figure  D. 
Except  for  the  one  HI  value^  ^^^Y  fairly  well  behaved. 
The  residuals  are  then  plotted  against^ each  independent 
variable;  against  K    in  Figure  E  and  L    in  Figure  F.  Both 
plots  appear  "random". 

    _         o_         _   .   _    ____ 

(d)  From  the  MREG  output »  R    ■  ,9597,  hence  our  model  "explains" 
the  data  very  well. 

(c>    The  elasticity  of  value  added  with  respect  to  labor/establish- 
inettt  is  ci  ».9276.    The  elasticity  of  value  added  with  respect 
to  capital/establishment  is  6  -  ^2788; 

Recall  from  fuf te  page  114  that  the  elasticity  of  Y  with__^^^ 
respect  to  X  tBeasures  the  percentage- change  in  Y  with  respect 
to  the  percentage  change  in  X.    Hence,  if  labor/estabiishncnt 


is  doubled  (i.e.  lODZ  increase^i  value  add^  increases  by  93%. 
SiAtlarly^  if  capital  flow  is  doubled »  value  added  increases 
by  28%. 

While  labor /establishment  appears  to  provide  the  larger 
propdrticmal  Increase  in  value  added  (since  o  <  g)i  the  key 
to  this  question  is  the:  unknown  costs.    Given  a  fixed  amount 
of  teaources^  vhat  la  the  unit  cost  (in  conmbn  units)  for  a 
given  proportional  change  in  labor /establishment  compared  to 
the  unit  csbst  for  the  same  proportional  change  in  capital/ 
eatablisbmeht . 

Pbr_  esuuBplii  if  each  Z  change  in  capital/estab  costs  $X»  and 
each  X  change  in  labbr/est.  cbsts  $Y  »  $2X»_then  fbr  apprbx^ 
imately  the  same  ambuht  of  resbiirces  ve  cbuld  (apprbximate) 
dbuble  value  added/establish  by  either  <|uadrupling  capital 
services/estab  or  doubling  labbr/es tab. »  br  by  a  coxnbihatlbh 
of  increased  labbr  and  capital  flow. 

Hence  the  optimal  pblicy  is_nbt  obyibu8«  depending  heavily 
upon  these  unknown  costs.    Research  is  required  to  determine 
these  costs. 
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FIG0RE  C 

MRE6  UfAL  VS  LCAP  LL^B  8AVERES  RES3 

RESPONSE  8EAN      STD.  DEV* 

LVAL  0.7717  0.8973 

CSRRIERl  CONSTANT               LCSF  LLSB 

COEFFZCZENT  2.2932            0.2783  0. 91276 

S.E.  COEF.  0.0807  0.0983 

HESN  -2.0820  -1.0x46 

STD.  DEW.  0.8730  0.71S4 

NULTZPLE  R  SQUARED  0.9597 

ANALYSIS  OF  VARIANCE  TABLE 

SS  DF                    MS  RHS 

Fit                               18^6281  2            9i3140  3.031? 

RESIDUAL                       0^7819  22            0.0335  0.1885 

TOTAL                             19.4100  24 

F  F  PR3B. 

Fit  262.0535  1.0000 
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In  this  problem  the  first  thing  we  must  do  is  consider  the  units 
and  magnitudes  of  the  data  we  are  working  with.  CONSUMPTION, 
VEHICLES j,_and  POP  are  each  in  units  of  1  with  magnitudes  of  6  or 
7^    PCINCOME  is  in  units  of  1  wlth^a  magnitude  of  3.    GASTAX  and 
PPPDENS  are  in  units  of  1  wjth  magnitudes  of  0  and  1»  respec- 
tively.   The  very  large  differences  in  these  magnitudes  will  be 
reflected  in  large  differences  in  the  magnitudes  of  the  variances 
of  the  variables.    We  will  also  be  faced  with  both  very  large  and 
vei^  small  coefficients  in  the  regression  equation.    To  avoid 
these  scaling  problems,  we  need  to  equalize  the  magnitudes  by 
increasing  the  units  of  those  variables  with  magnitudes  greater 
than  0  or  1.     (There  may  aiso  be  situations  in  which  units 
should  be  decreased  to  achieve  equality  of  magnitudes.)  The 
appropriate  DAP  commands  are: 

LET    CONSUMP  =  CONSUMPTION/ 1000000 
LET    VEHIC  ^  VEHICLES /iOOOOOO 

LET  p  *  pop/ibdbbbo 

LET    PCI  -  PCINeO^ffi/i000 

(NOTE:    The  variable  names  on  the  left  hand  sides  of  the  equations 
can  be  any  name  of  your  choice.) 

The  resulting  variables  CONSUIff,  VEHIC,  and  P  are  iii  units  of  1 
million  with  magnitudes  of  0  and  1.    PCI  is  in  units  of  1  thousand 
with  magnitude  0. 

(a)    A  plot  of  CbNSW  vs^  VEHIC  is  shown  in  Figure  A.    The  data 
look  remarkably  linear. 

A  plot  of  CONSUMP  vs.  GASTAX  is  shown  in  Figure  B.    The  pat- 
tern (if  any)  appears  linear. 

A  plot  of  CONSUMP  vs.  P  is  shown  in  Figure  C.    The  data  appear 
reasonably  linear. 

A  plot  of  CONSUMP  vs.  POPDENS  is  shown  in  Figure         While  a 
transformation  appears_tb  be  in  order,  none  seems  to  improve 
the  pattern.    LDG^  SQRT,  and  Negative  Reciprocal  were  tried 
with  no  noticeable  improvement. 

A  plot  of  CONSUMP  vs.  PCI  is  shown  in  Figure  E.^^e  plot 
suggests  some  linearity  and  no  transforation  seems  applicable. 
The  cone  shaped  Jpread  is  of  some  concern,  but  a  Log  transfor- 
mation of  CONSUMP  does  libt  improve  the  plot  much.    A  transfor- 
mation of  the  dependent  variable  would  also  necessitate 
transformations  of  the  independent  variables  which  exhibit 
linearity  in  the  raw  form.    The  slight  improvement  in  spread 
in  the  one  plot  is  not  worth  the  sacrifice  of  simplicity  of  the 
model.  7 J  § 
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Examining  these  plots,  VEHIC  and  P  appear  most ^  and  GftStAX 
and  POPDENS  least,  related  to  fuel  consutnptlon; 

The  HREG  output  f or  CONSUHP  vs.  each  of  the  five  X  variables 
Is  shown  in  Figure  F. 

While  the  model  "makes  sense"  in  that  we  are  attempting  to 
model  or  predict  fuel  cpn8unq>tion  by  five  measurable  qual^ 
itatively  related  X  variables,  it  is  a  good  example  of  one  in 
which  we  have  tod  many  unnecessary  ''independent'^v^ 
resulting  in  much  complexity  with  little  or  no  offsetting 
Improvement  iii  accuracy  (as  we  shall  see  in  (e)  below) 

First,  we  might  expect  VEHIC  to  be  related  to  P^  Certainly 
we  expect  more  vehicles  when  there  are  more  people,^    I?  _  _ 
cities,  however J  we  might  expect  considerably  fewer  vdiicles 
(many  city  dwellers  do  not  own  cars,  although  the  number_of 
cabs  aiid  buses  would  be  larger).    l^^^l^^^^^^iOr  rural  areas, 
we  mi^t  expect  a  larger  number  of  vehicles  for  that  popula- 
tibn  •    The  number  of  vehicles  per^erson  is  no t  constant 
across  geographic  area,  although  some  relationship  between 
these  two  variables  obviously  exists. 

Economic  theory  tells  us  that  GASTAX  should  not  have  a  great 
effect  on  CONSUMP.    Our  plots  in  part  (a)  reinforce_th±s.  We 
expect  people  to  continue  to  utilize  the^availabie  modes  of 
transportation  in  order  to  commute  to  and  from  work^  regard- 
less of  the  inarginal  differences  in  the  price  of  gasoline  due 
to  taxes  (as  opposed  to  dramatic  price  increases  caused  by 
other  factors) • 

PCI  should  probably  have  a  ''threshold**  effect  on  CONSU^.- 
Wheti  median  income  is  below  some  threshold  value,  people  can- 
not afford  a  private  car  and  must  rely  on  public  transporta- 
tion.   Above  this  threshold,  CONSU^  would  then  rise  to  some 
limiting  value,  after  which  increases  in  income  do  not  have 
any  effect. 

POPDENS  serves  as  a  sortofindex  of  urbanization.-  As  dis- 
cussed above,  we  expect  fewer  autos  per  capita_ (but  more  buses^ 
cabs,  and  possibly  truck  traffic)  in  highly  urban  regions^ 
while  we  expect  the  reverse  in  rural  areas. 

A  stem-and-leaf  and  boxplbt  of  the  residuals  lis  shown  in  Figure 
G.    The  plots  should  make  us  suspicious  of  bur  models  since  the 
residuals  are  not  particularly  well  behaved.    Note  the  large 
number  of  HI  values,  and  the  overall  location  of  the  batch 
(i.e.  nonzero  median).    A  plot  of  the  residuals  vs^icach, inde- 
pendent variable  would  also  be  helpful,  perhaps  indicating  a 
hidden  relationship  (e.g.  consumption  vs.  a  quadratic  In 
POPDENS).    the  residua      are  all  small,  however,  relative  to 
the  size  of  the  original  CONSUHP  tinlts. 


(d)  R^  for_ this  model  is  ^9875  ^Figure  F5,_  This  indicates  that 
the  model _"e3q>lalns"  the  data  very  well.     (But  the  answer 

in  (c)  ahbve  should  caution  us  against  evaluating  the  adequa- 
cy of  a  model  solely  on  its  r2  value) . 

(e)  from  our  discussion  in  (bj  and  the  plots  in  (a)  ^  we  know  _ 
that  one  of  the  two  must  be  VEHIC.    Since  we  expect  P  to  be 
functionally  related  to  VEHIC,  PCI  is  the  best  choice  for  a 
second  variable,  as  is  indicated  by  the  t  statistics. 

The  MEffiG  output  for  this  model  is  shown  in  Figure  H.  Note 
that  the  coefficients  for  these  two  variables  are_almost_ 
exactly  the  same  as  those  in  the  complete  model  (Figure  F) . 

&_8tem-and-leaf ^  and  a  bbxplot  of  the  reslduals.ls  shown 
in  Figure  I.    There  is  some  Imprbvemeht  (fewer  HI  values) 
over  the  full  model,  but  the  plots  are  similarly  suspicious • 

for  this  (reduced)  model  is  .9864,  almost  the  same  as  that 
for  the  "complete"  model. 

Now,  lidte  that  in  the  reduced  model: 

(i)    The  coefficients  of  the  remaining_ variables  are  virtually 
identical  to  their  respective  coefficients  in  the 
"complete"  model. 

(ii)    The  residual  patterns  are  similar  to  those  of  the 
"complete"  model. 

(iii)    The  R    value  is  essentially  the  same  as  that  of  the 
"complete"  model. 

These  are  three  very  strong  indications  that  the  three  inde- 
pendent _  variables  dropped  from  the  "complete"  model  were  in 
fact  unnecessary. 

(f)  There  is  really  no  reason  even  to  consider  using  the   

"complete"  model,  since  the  reduced  model  does  just  as  "well" 
(e  above) .    Note  that  in  the  reduced  model  all  data  are  from 
1973. 

For ^policy,  decision  making,  or  detailed  study  purposes ^ 
however,  ail  data  should  be  from  the  same  time  frame. 

(g)  One  possibility  is  to  use  AUTOS,  BUSES,  and  TRUCKS  instead  of 
VEHICLES. 

Price  per  gallon  at  the  pump  (PP6AL)  might  be  a  good 


XVI. 11.350 


Module  II 


predictdty  since  we  expect  radically  higher  prices  to 
discourage  leisure  and  "unnecessary"  driving. 

Riles  of  highway  ^ILES!*^1R0iU[))  might  be  used  as  another 
Indication  of  mobility  in  each  state. 

Indicator  variables  for  mlnlmuin  driving  ages  (IND!«*AGE) 
and  for  mandatory  state  auto  inspections  *(Iin>!-*INSFECX) 
might  alio  be  used  to  detect  possible  Influences  of  driving 
habits  of  autbiiu>blle  efficiency  oh  gas  cbhsumption 

Other  pbssibilities  ho  doubt  exist.    Clearly  however, 
cohsumptibh  depends  primarily  on  the  number  of  vehicles  and 
the  degree  of  use. 

The  model  might  then  be 


CONSUHPTION  «  b^+-  bj  AUTOS  -F  bg  BUSES  +  b^  TRUCKS  +  PPGAt 
+bj  JflLESNRO&D  +  bg  AGE      b^  INSPECT 


*  §??^ii^8_9?§come_of  the  regressions  in  this  probira  is  the 
??8?§iv?  coefficient  for  PCii    ^his  is  not  \ftat  ve  would 
expect  in  theory  nor  what  we  would  have  Euessed  from  the  plot 
of  eONSU^  vs  PiSi  (Figure  E) .    it^  seras  likely  that  the_ 
large  variation  in  the  CONSUSP  coordinates  for  the  large  PCi 
values  permits  a  great  deal  of  latitude  ih_ fitting  a  least 
squares  line.    6iven  this  unexpected  and  dlffi         to  explain 
outcome*  we  might  try  a  tmivariate  regression  of  CONSUL  vs 
VEHie.    fhe„^G  result  is  shown  in  Figure  J,    Note  that  rt^ 

Is  virtually  unchanged,    the  residual  structure  is  even 
somewhat  Improved  (more  symmetric^  smaller  values^  as  is 
shown  by  a  stm-and-leaf  display  and  boxplot  (Figure  K)  • 

2  _  _ 

AnytlxDe  we  are  able  to  achieve  a  very  high  R    with  a  good 
residual  structure  using  univariate  regression,  we  should 
be  very  reluctant  to  add  other  variables  for  small 
"improvements"  In  the  explanatory  power  of  the  model.  This 
problem  is  a  clear  case  of  where  the  univariate  model  is  the 
"best". 
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4.    Stocfearket  Prbbiem 

(.)    SMREIND  vs.  CASPROD-tAGe  is  plotted       Figure  A  and  idoks 
somewhat  curved. 

SHiSEIND  vs.  C0MMIND-tAG7  is  plotted  in  Figure  B-    A  somewhat 
l^r  pattern  (vith  negative  elope)  «  discernable. 

SHAREIND  vs.  TIJffi  ^DM)  is  plotted^ in  Figure  C.    A  strong 
linear  (although  jaggedj  pattern  is  clear. 

perform  any  transformations  on  Y  or  the  X  s. 
(b)    The  HREG  output  is  shown  in  Figure  D. 

be  a  reasonable  ecdndtnic  theory. 

Shduia  bi  tivcstlsatsd  »!•  economic  theory. 

J  tiif  Slnt  of  the  residbals  Is  shotm  in  Figure  t. 
i^°-'S;f}lillUi?J  baha«d  (although  «  should^ 

S-'f)!  Iloirof  SUd«J1l  "fc-«G?  (Fig»*e  0,  and 
C0MMIND-iAG7  (Figure  H)  are  more    random  . 

«5    The  IfiffiG  output  shows  r2  -  .8295.    This  suggests  that  the 
Sdeiprovides  a  "good"  explanation  of  the  data. 

(e)    Your  answer  to  this  part, will  depend  in  P«|  '^^^^ 
.f  risk  aversion,    ^or  most  people,  the^mode^^^^^^ 
sufficiently  accurate, ^nor^does^ it  ^Jf  Ip^gn^^^hose  who  are 
to.be  used  ^-^1^°-;^!^^^^^^ 

Sri'^ilivelhis'Sdln  go'i^  an  atlempt  to  '^eat  the 
system'** 
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QiiiZ  f  Unit  ^ 


too  httve  3b  mtnutes  to  conplete  ttte  quiz.    Answer  all  questions^  but 
Background 

You  are  wfking  for  the  federal  Department  of  Housing^and  0rban^^ 
Devilopment  (HUD) .    The  department  has  jafC^c^inpieted  a  project  ta  which 
$1.3  mill  ion  was  allocated  to  developers  for  the  purpose  of  constructing 
new  graduate  level  public  inanagei&ent  curricula  in  three  l^eed"  areas^ 
Solicited  proposals  were  scored  by  experts,    ^lie  approximately  260  pro- 
fk>sal8  received  were  divided  equally  in  the  thrae  need  areas,  ^he 
mxinuxa  obtainable  score  vas  81  and^  the  lowest  was  0.-  To  deternodne  how 
"fair"  the  granting  procedure  was  HUD  has  asked  you  to  perform  an 
evaluation.    You  gathered  data  on  various  characteristics  of  the  propos- 
als »  constructed  a  nbdel  and  used  least  squares  techniques  to  estimate 
parameters.    The  results  appear  below. 


Dependent  Variable:    Total  Score  for  Proposal 

Independent  Coefficient  t-Btatt«tie 

Variable  Estimate  ^  statistic 


Length  (in  pages) 
Badget^Request  (in  dollars) 
Need  area  i 
Need  area  2 
Constant 


.21 


.74 
.OS 

-7.11 
1.80 

47.87 


5.55 
.21 

3.10 
.76 


jExPlanatton  of  Variables 

length  -  mnober  of  pages;  Min      6    Max  >  58. 

Budget  Request  -  how  much  money  was  requested  to  do  the  task. 

Need  areas  1,  2,  3 — areas  of  training  in  which  HUD  believes  new  curricula 

are  required.    Introduced  as  0/1  indicator  variable 
with  following  structure: 

Need  Area:      1         2  3 

Variable  values:      10  0 

0  10 
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!•    Why  do  the  regression  resulti  Imply  that  quality  was  not  the  only 
fkctor  Influencing  the  selection  prdcesi?    20  points 


2.    The  fdlldwltig  is  a  plot  of  page  site  stgalhst  total  score. 
ISO 


to 

HO 

as 


^s    «*o  2e 


&•    why  does  this 
Suspect?  lb 


plot  suggest  that  the  regressibh  results  are 


b.    Give  tvo  alternative  forms  for  the  page*  variable  that  vould  help 
Buinnarize  its  behavior  better.  ^  points 


3.    HUD  has  been  under  attack  f bt  if avorihg  schools  in  the  Northeast  over 
other  areas  of  the  country •    How  would  ypxi  test  this  prbpbsitidn 
using  these  data,  assuming  you  know  which  ihstitutibhs  submitted  the 
proposals?  20  pbihts 


4.    a.    What  is  the  "effect"  bf  a  prbpbsal's  being  in  need  area  3? 
20  points 


b.    What  is  the  e8tim&;:ed  standard  error  bf  the  length  cbefficieht? 
20  pbihts 


XVI. 11. 369 


750 


Quiz J  Unit  4 
Solutiorss 


Both  page  length  arid  curricalum  need  area  one  had  eignif leant 
coefficients  indicating  that  these  had  an  eJfect  oh  the  Belectldh 
process. 

There  is  curvature  evident  in  the  variable  for  length  of  prbposaij 
the  relatibhship  is  not  etrictlyilnear.    Linear  regression  assumes 
a  linear  relationship  between  the  dependent  and  each  independent 
variable. 

A  parabolic  fit  of  the  fonn^p  +  p  or  using  splines  would  have  been 
appropriate  alternative  forms. 

The  propositibh  could  be  analyzed  through  the  use  of  dummy 
variables  indicating  the  region  of  the  institutions  submitting  the 
proposals. 

The  "effect"  b'^      proposal  being  classed  in  curriculum  need  area  3 
was  the  constant,  i.e. .  47.87.    The  coefficients  for  need  areas  1 
and  2  make  adjustmentB  to  the  constant. 


coefficient   t-statistic 

standard  error 


coeffici^  -  standard  error 
t-statistic 


-  .133 


5.55 


751 
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Final  E^camihatidh 
First  Term 


'Name 


▲ll  answers  sfebuld  be  written  on  th.     test.    Total  point  score  is  lOD, 
fditttvalue  for  each  question  are  indicated.    Do  hot  spend  all  your  time 
on  qoestioDS  with  low  point  values!    You  have  2.6  hours  to  complete  this 
ttcaminatton*    Your  answers  should  be  concise  and  to  the  point. 


For  the  fbllbvihg  problems  assume  that  fou  are  a  pu>>lic  manager  in 
a  government  agency. 


(1)    When  you  were  first  interviewed  for  this  j 6b ^  your  supervisor 

could  hot  understand  why  a  public  manager  heeded  to  take  QMPM.  You 
answered  her  by  making  four  points.     Circle  them:         4  pt. 

a.     Knowledge  of  statistics  improves  rational  th/.iikihg. 

6.    A  good  knowledge  of  statistics  makes  you  a  better  liar. 

c.  Public  managers  are  essentially  decision  makers. 

d.  You  need  a  graduate  education  in  statistics  to  read  the  New 
Yo-  Times. 

e.  K       public  policy  analyses  are  unplanned  and  post  hoc. 

f.  Puolic  -ianagers  heed  to  be  ab]e  to  perform^  interpret  and  pre- 
sent quantity    ..ve  analyses. 

g.  Knowledge  of  skills  in  data  analysis  assures  a  GS-11  rating. 

h.  More  often  then  not^  data  relevant  to  public  policy  decisibhs 
are  quantitative  arid  "messy". 


(2)    The  first  task  you  encburiter  bri  the  jbb  ihvblves  ah  analysis  of 
physician  offices  by  cerisus  tract  in  urban  areas.    The  policy 
Issue  cdricerris  the  equity  bf  access  by  urban  residents  to 
physicians. 

a.    Tobegin  the  analysis  ybu  ask  tb_ see  a  batch  of  physician 

office  data  for  a  single  city.  Ybur  supervisbr  asks  ybu  what  a 
batch  is.    Ybu  reply....  2  pt.  -  __ 

^02 


3CVI.I1.37i 


b. 


fnr  the  "average"  Sutnber  of  physiciahs  per  tract 
m  asgrSSi  lSe%i...  over  .he  ».» 


the  inode^  - 
and  you  replied...*  JLEE 


c. 


Her  next  request  concerns  variation  in^the  data.  Calculate 
the  v«riance".  she  says.  But  you.he^iate  and  calculate  the 
H-spread.  How  are  they  related,  and  when  would  you  M  hesi- 
tate to  calculate  the  variance?  i_£t. 


a      Imtsressed  with  your  knowledge  she  asks  you  to  draw  a  histograE. 

^'    grconstracra^t«n-a^^^^  a^S'lavs'" 
essential  difference  between  these  two  kinds  of  displays? 


the  it^and-leaf  shows  that  physician  of f ic,^§__ are  frequent 
in  only  i  fev  census  tract8,,and  that,  by  and  large,  »ost_^ 
tracts  have  only  a  few  offices  and  many  have  none..  Sketch 
an  outline  of  ihat  the  stem-and-leaf  of  this  ^t^  probably 
looks  like.    2  pt^ 


f     Whei  vou  see  the  steffl-and-ieif  of  the  physician  data,  you  ^ 
^IdiSely  suggest  re-expressing.    What  sisl.t:  .  re-expression 

achieve?  2  pt.  fv  -  ■■. 

1  O.i 
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Hbdule  li 


When  you  talk  about  ^his  problem,  you  casually  mehtibh  the 
simple  ladder  of  powers. _  What  is  the  simple  ladder  of  powers^ 
and  why  Is  it  relevant  to  this  problem?    A  pt> 


She  tli^n  asks  how  you  decided  which  trahsformatibh  to  use,  and 
you  tell  her  about  c  general  technique.    What  is  it?    2  pt. 


Since  physician  of f ices  are  counted  data,  you  had  in  mind  a 
specific  transforation  before  you  even  try  this  method  for 
finding  a  transformation.    What  transformation  wae  it  that 
ydii  had  In  mind?    2  pt. 


A  colleague-suggests  looking  at  the  data  differently. 
Instead  of  counts  he  suggests  divldihgthe  number  of  offices 
in  each  tract  by  the  total  number  of  offices  in  the  city* 
This  wouidyisld  a  vari  Vie  that  was  the  propdrtion  of  a 
city*:  physician  office:  in  each  tract.    What  transformation 
would  you  be  likely  to  rxy  on  the  data  when  they  are  in 
this  form?      2  pt. 


Elated  with  your  thorough  analysis  of  the  single  batch  of 
physician  data,  another  colleague  suggested  that  you  iook-- 
at  data  from  several  cititis  siinultaneously  and  compare  them. 
Knbwiiig  that  cities  vary  quite  a  bit  in  aije  you  thought  that 
r«-expres8ibh  wbiild  be  needed.    Why  might  you  need  to  trans^ 
fbrru  the  data  for  this  coisigarisdn,  and  how  would  you  find  the 
appropriate  trahsf brmatibh?  4  pt. 


^54 
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Pursuing  the  analysis  of  office  location,  you  decide  to  see  if  you 
could  construct  a  nodel  relating  the  tiuniber  of  offices  in  a  tract 
to  other  public  policy  relevant  features  of  the  census  tract; 

a.  In  constructing  the  model  you  are  told  to  include  m^ian 
income  and  socibecdnbxziic  status  as  independent  variables^ 
but  you  suspect  that  these  variables  are  highly  correlated; 
This  means  that    3^  pt « 

(a)    There  is  a  linear  relationship  between  them; 

Cb5    Their  cbvariance  is  positive. 

^c5    They  are  related  in  a  curvilinear  fashion; 

(d)    The  cbyariahce  equals  the  product  of  the  standard 
deviations. 

b.  You  expect  to  use  ieastiquares  techniques  tb  estimate  your 
model.    Consequently,  you_8uspect  that  tvb  prbblems  may 
arise  because  of  these  correlated  variables.    2  pt. 


(a) 

-2                   -  -  — 
R   will  be  near  unity. 

0>) 

Their  coefficient  estimates  will  be  unreliable. 

(c) 

The  residuals  will  hn  randor. 

(d) 

y*  trill  not  invert. 

(e) 

The  computer  may  have  prbblems  i-^^xVv 

(f) 

The  coefficient  of  determihat:  v 

11  be  ihdeterminant . 

c.    To  get  around  these  problems  you  suggest  two  possible  solu- 
tions; X4>t. 

(a)  Use  ovly  one  of  the  pair. 

(b)  Add  one  to  each  variable  and  take  logs. 

(c)  Use  weighted  least  squares, 

(d)  Use  ridge  regression. 

(e)  Use  the  arc-sin  square  root  transformation. 
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Vben  ycu  tell  your  Bupervisbr  about  your  intentions  to  use 
least  squares  to  estimate  the  model  she  asks  vhat  this  neans* 
Youir  reply  Is  that  it  uses  one  specific  irJLnimization  criterion 

1  pt. 


which  Is: 

(a)  r(Y^ 

(b)  i|y^ 

(c)  S(Y^ 

(d)  E(y2 

(e)  None 

of  the  above. 

You  continue  your  explanation  by  saying  that^  "If  the  assump- 
tions underlying  least  squares  hold,  then  this  procedure 
yields  pptlmai  estimates  of  the  coefficients".    What  are  the 
assumptions?     n3-pt « 


"But  in  what  sense  ere  least  squares  regression  lines  op- 
timal? she  asks.    You  reply...  3  pt. 


•OK'S  your  colleague  pipes  up,  "so  they  are  optimal  when 
the  assumptions  hold.    ButBuppose  that  for  bur  data  the 
assumptions  don't  hold.    What  does  that  imply?"    In  your 
response  you  touch  on  the  consequences  of  failure  in  jach 
assumption.    What  do  you  say?  A^SlLl 


XVl.IIi37 


(4) 


At  this  wlnt  y5u  have  estabiiihed  your  credentials  as  a  com- 
fltSt  pSicy  researcher.    Word  passes  quickly  through  offices 
to  S  agency;  and  when  you  turti  around  you  meet ^ an  economist 
£?d^  aoSn  a«  hall.    He  explalnf  that.he,  too.  Is  working  on _ 
r^yslctan  study  but  he  has  b.e«  investigating  the  growth  of 
th*  Blivslcian  supply  in  different  countries.    He  says  he  is 
Sitif lo  iir.^?ralght  line  to  the  data  from  five  cotintries  ^_ 
Snontrast  slopes.    You  are  aghast  and  demand  to  pee  plots  of 
wpply  versus  lime.    He  shows  you  the  five  plots  below  and  you 
ESS  thai  you  are  right.again.    Instead  of  simply 
dSta  throulh  you  counsel  fitting  different  models^ tp  summarize 

batS  of  paired  values.    Which  alternatives  below  do  you 
suggest  for  each  of  the  plots?      10  Pt. 

1)  logged  dependent  variable 

ii)  logged  tndeperdeht  variable 

111)  logged  dependent  and  Independent  variable 

Iv)  continuaas  spline 

V)  dlsconttnabus  spline 

vi)  straight  line  linear  model 

vll)  linear  model  after  removal  of  outliers 

viil)  linear  model  with  'iumsy  variable  (s5 


physicians 


tlma 


7r, 


ERIC 


(c) 
pbyaiciAQS 
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time 
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phyvielanc 


7» 


time 


(e) 


physicians 


time 


(5)    Vfry  conceraed  about  this  affront  to  his  feowledge  in  the  presence 
of  htB  p«ra,  the  sconOTtet  ex^latns^  that  it  doesh^t  aake  any 
£tffaranc^«    Retatntng  your  cool^  you  politelydiaagree  and  for 
•acfa  of  the  f tve  caaes  yoo  describe  How  a  siiiq>le  linear  fit  atid 
the>*its  yott*^Ye  inggested  voold  dif:.        Ymr  ^explanations  focus 
on  what  the  residuals  from  the  sii^^^its  would  look  like.  10  pt> 


Module  II 


I^ressed  by  your  kgwleagfezyo^r  colleague  admits  to  haying, 
perforaed  rcgreBstons  automatically^ before.    He  shows  ybii  the 
following  residual  plots  and  asks  what  he  should  do  next 
aach.    10  ptr. 


residual 
R 


>  % 


V 


fitted 


S*  St 


*    >t     *  > 
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(7)    The  iime  cbllesgue  telli  you  of  c  study  of  income  and  years  of 

phyilclin  tr«lnlng-ln  different  countries^    He  shows  you  the ___ 
follovlng  plots.    Since  wch  Is  a  univariate  situation  you  suggest 
ireslitaht  lines.    But  he  doesn't  understand  end  you  sketch  <on  the 
plots)  how  least  eouar^  and  resistant  lines  would  look  in  each 
iituatlbh.  lOpt. 


Uticoiiie 


m 


Tratntng 


(b) 


9- 


X 


xvi.ii,38i 
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(c) 


(e) 


>     >^  * 


^^3 
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Ones  you  got  rtd  of  your  now  better  Infonned  coileagoe  you  sre 
once  ttgsln  sbls  to  coucentrste  on  your  own  problem^  physlclin 
iDCstlon  In  eltlu.    Someone  drevs  your  Attention  to  cn  ertlcle 
en  phyilclsn  locetlon  by  the  femoui  iclentlflc  researchteam  of 
Ksplen  end  Lelnherdt.    You  know  this  paper  end  recall  its  presen* 
tetlon  end  findings.    In  perticuler,  you  recall  that  they  found 
little  or  no  effect  for  en  eree^i, income  end  racial  characteristics 
In  the  paper  they  resent  the  results  of  their  regression  runs  ^ind 
In^ly  that  this  should  convince.    Were  you  convinced?    If  so,  fi^xyl 
if  not  J  vhat  else  would  you  want  to  see  before  you  vere  convince^i? 

6  pf , 


You  also  recall  that  they  included  variables  that  had  policy 
relevance  from  two  points  of  view.    What  ere  these  points  of  view 
and  what  are  examples  of  variables  in  each  category?    4  pt. 


9011 


Final  Ejcamlnatton  SoltttiotiB 
First  Term 


(2)«. 


-  «  ff^    and  fh),  although  Huff 

Correct  answers  are  (c),  lej,  ^t;.  » 

Miphasites  that  (t)  nay  be  true. 

A  batch  of  data  is  a  set  of  similar  nuniers.  obtained  in  some 
consistent  fashion* 

E.ch  -veragi.  the  «e.n,  -^l^^y-^^^l'.iy  |?|hf  bItS'ir 
a  batch.    The  mean  is  »  P"|j^|y°°^  H^ever.  unlike  the  med- 
S"1f  SSe  S/dSci:Snt^  oSlyiJ^^se^^^^^^    in  the 
m^^^^  -ry  Sensitive  to^these^dep^^^^ 
.H..ll-behavedness".  since  ve  "ust^sum  all  the_^  ,„pervisor, 

r^'^le^mSiaf  •thfJ^re  S^er^^tiSn  of  the  batch!  or  per- 
Srthf^^S^^  data^alue  with  the  largest  frequency.  viU 
be  «re  tWlcal  of  the  batch  than  the  mean. 

=    ^         t-h*  diff evince  between  the  hinges,  and  the  variance 
The  H-spread,  the  difference  oe         ^^^^^j^^^g^  ^atch, 
are  both  measures  of  spread,    m  a  wexx  uei 

3/4  .  H-spread  -  V  variance  • 
»  A  4e  »  mrsre  Sesistaht  measure  of  spread  than  the 

acceptable  measure, 

"StS.^-  oS^afS'lght.  Of  tL  bSs  .re  lndlc.«6.  «.d  the 
leaf  digits  are  discarded. 

765 
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Onit  -  1% 

o" 


or  values  with  this  shap 


A  reexpression  of  these  data  will  promote  synmetr^  within  the 
batch  and  perhaps  decrease  the  number  of  outlying  values. 


The  ladder  of  powers  is  a  graphical  representation  of  reexpfes- 
8 ions  of  the  form  X     ^  f or  values  of  R  such  as  -1,  -1/2, 
b("log),  i/2,        2.    Ve  can  tase  the  ladder  to  determine  the' 
effects  that  various  reexpressions  will  have  oh  the  original 
batch,  e^g.»a^batch  skewed  to  the  right  has ^a  long  right  tail^ 
and  a  transformation  down  the  1^  X  ^  X*'^  or  log  will 

force  more  of  the  observations  into  the  right  tail  of  the  data^ 
while  making  the  skewhess  less  noticeable. 


fc^Sne  the  mera^  mediaiir  and  midhinge  ■  1/2  (IJH  +  IB)  j  and  per- 
haps the  mid  extreme  "  1/2  CE  +  S)«    In  a  symmetric  batchy  these 
quantities  are  equal.    By  reexpr ess ihg  the  5-humber  summary  of 
the  batchy  yota  can  co^ute  the  4  quantities  and  examine  their 
equality/inequality  and  hence,  find  the  best  transformation. 
Hint: 

if  med  <  midsp  <  midext.^  go  down  the  ladder 
If  med  >  midsp  >  midext.,  go  up  the  ladder.;' 

 ____   _   __  _-__#;,_  -_-_--__ 

Square  root,  or  perhaps  logarithms,  movingtdowh  the  ladder> 


Square  root  of  the  arcsihe  of  proportion  physician  offices  in 
each  tract. 
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Iti    On*  ■uggestton  is  nerely  to  exaaine  the  proportion  of^  ^ 

plsys tcta^  offices  In  each  tract  In  each  city ,  since  these  data^ 
ir«  indepMdent  of  the  ilze  of  the  city.    If  you  desire  to  work 
with  ^e  actual  numbers  and  discover  that  the  spread  of  the  bat- 
ches increasis  ox  decreases  as  the  location  of  each  batch 
changes,  then  a  log  median  vs  log  midspread  plot  will  reveal  a 
reopresslon  that  stabilises  the  spreads. 

(3)a;  Correct  response  Is  Ca) • 

b;  Correct  responses  are  (b)  end  (e) . 

e;  Correct  responses  are  Ca)  and  (d) . 

d;  Correct  response  Is  (c) . 

e.    Assumptions  are 

(1)  Thi  mbdel  is  correct,  i.e.,  y  Is  a  linear  function  of  the 

(2)  Residuals  are  independent. 

(3)  Residuals  are  hbmpscedastic. 

(4)  Residuals  are  well-behaved. 

f  Out  5f  all  linear  unbiased  estimates,  the  least  squares  line  is 
the  one  with  minimum  variance,  teast  squares  lines  are  optimal 
only  if  the  A  assumptions  hold. 

B      (1)    If  the  model  Is  not  correct  the  regression  cbefficlents  do 
iot  istimate  the  true  population  coefficient  values.  More- 
over, hone  of  the  computed  regression  statistics  are 
believable. 

(2)    Sonindepehdehce  Of  the  restduali  indicates  that  the  observa- 
tions are  related.    The       statistic  will  not  measure  the 
goodness  of  flt^  and  the  standard  errors  of  the  regression 
coefficients  will  not  be  accurately  computed. 

(35    Heterbscedastic  errors  h^e  the  saie  effect  on  the  computed 
regression  statistics  as  nonindependence  of  errors. 

(4)    Son-well^ehavedness  of  reslduils  invalidates  various 
distributional  assunvttons,  e.g.  batches  of  regression 
coefficients  will  not  be  well  behaved. 
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Dttmny  variable. 


b.    Logged  dependent  variable. 


c.    Straight  line  linear  model  after  removal  of  outlier. 


d.    togged  dependent  or  logged 
necessary. 


independent  variable,  or  both  If 


e.    Continuous  spline. 


C5)    For  Mch  of  thesuggested  fitSj  the  residuals*  plotted  against 
time i  should  appear  as  a  rradbm  swans  of  points r_ce^  on  the 

time  tocist  and  ^s  a  vell-behaved  batc:>\  when  displayed  in  a  stem- 
and-ieaf . 

The  residuals  of  the  fits  using  the  br^^lnal  data  vlll  exhibit 
various  patterns^  as  shown  below: 


(a) 


•  •  • 


3\clusters  of  residuals 


(b) 
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quadratic  pattern 
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(d)    Mme  M  b. 


trlgohometrtc  pattern 


_  • 

•  % 


{SyA:    Do  nothing. 

b.    Transform:    X  up  the  ladder^  or  Y  down. 

c»    Transform  Y  to  removiB  heteroscedasticlty,  or  used  weighted  least 
squares. 

d.  Fit  a  quadratic  to  the  data: 

y  -  a  +  bX  + 

e.  Cb^utatibhs  probably  incorrect.    Make  Mr e  that  X  has  been 
included  in  the  model,  where  X  *  variable  of  x-axis. 
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_  % 
>  >^ 


Both  lines  are 
undefined  since 
there  Is  not 
enough  yaristion 
in  the  X  variable. 
However »  different 
computer  programs 
for  least  squares 
regression  vlll 
fit  different 
lines. 
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Kapian  and  teinhardi  studied  a  rather  cbhtrbverslal  issue  and  „ 
their  results^  as  indicated  in  the  exam  question^ 
nost  expectations.    As  the  «^  questions  also  po^^       6ut»  how- 
ever,  they  reported  only  the  regression  results That  is  ^  they, 
gave  coefficient  values  ^d  t--8tatistic8  and  expected  the  reader 
to  Be  convinced  by  their  results.    In  ai^  analyses  and  especially 
one  jtavolving  a  highly  controversial^  we  would  want  more 

Information  on  the  data  analysis  procedure i    The  issue  of  Belief 
here  is  one  of  an  evaluation  of  theeff ectiveness  of  the  analysis. 
fhuSj  we  would  want  to  see  such  things  as  a  correlation  matrix  to 
see  if  collinearity  occurred,  stenwand-leaf  and  plots  of  residuals 
to  detect  hsteroscedasticity:^  non  linearity  and  non-veil  behaved- 
nessy  discussion  of  possibleintera^  of  independent 

variables  against  the  dependent  variable  to  explore  for^eeded 
transformations.    A  general  discussion  of ^he  exploratory ^^^^ 
should  always  appear  but  actual  exploratory  resuit8_are  especially 
important  in  this  case  because  policy  may  be  influenced  by  the 
analytic  results  and  if  the  results  are  due  to  poor  analysis  the 
policy  may  do  more  harm  then  good. 

Policy  variables  may  be  of  two  types*    On  the  one  fiahd^  they  may  be 
"policy  manipulable",  i^e^,  we  may  be  able  to  construct  public  pol- 
icies which  actually  change  these  variables.    Oh  the  other  hahdj 
they  may  be  "policy  directive",  i.e^^,  while  not  actually  amenable 
to  manipulation  by  policy  they  may  rrtside  policy  development  by 
focusing  attention  on  speciai_s±tua::ions  or  target  groups.  Exam- 
ples of  the  former  variables  aresoning  regulations t  hospital  beds, 
physician  offices,  education,  itscome.    Ebc^ples  of  the  latter  are 
race,  age,  population.^    It  should  ^ilsb  be  pointed  but  that  if  the 
policy  issue  is  one  involving  ihe  iiication  bf  individuals  even 
policy  directive  variables  may  becrme  pblicy  manipulable  variables. 
For  example,  we  can ^ange  the  age  distribution  of  a  census  tract 
by  erecting  an  apartment  house  solely  for  senior  citizens. 
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Covartances  and  independence  in  the  Blvartate 
Multiple  Regression  Model 


Consider  the  situatibh  of  two  "indclpendent"  x  variables: 
The  least  squares  solution  for  b  is 

To  make  this  result  easier  to  comprehend,  center  the  variables  by  sub- 
tracting their  means: 


*il  '  ^1*  ^t2 


This  shift  of  location  forces  the  line  to  pass  through  the  origin  and, 
therefore,  t6  have  a  y  intercept  of  0.    Thus,  b^  -  0  in  this  "new"  model 
and  we  do  not  heed  a  cblvram  of  ones  in  the  X  matrix. 


X  becomes 


^11  "  ^1 
^21  "  ^1 


^1 


^12  "  ?2  ^ 


^22  '  ^ 


^2  '  H 


J 


and  X'  is 


X 


12 


^1 


X,,  -  X,  . 
^22  '  ^2  • 


^1  '  ^1 


^2 


-  X^ 


) 
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is  the  product  of  these  two  matrices: 


Recall  that  the  variance  of  X  is  defined  as 

1     N  -  2 

Var  X  =    I     r     (Xi  -  X) 

i=l  ^ 


Xg)       i:(x^2  -  ^2^^/ 


and  the  covariance  of       and  X-  is 

Cov(Xj^.Xp)  -  I    1   (X^k  -  V^^ip  -  V 

(When  the  covariance  of  Xj_  and  Xg  is  0  then  the  variables  are  not  linearl; 
related;  when  it  is  >  0  or  <  0  they  are  linearly  related.)  Therefore, 
^'x)  can  be  written  as 


X'X  =  N 


jf  Var  Xj^  Cov(Xj^ 


X2)  var 


.X2)  ^ 
r  X_  / 


1^  Cov(Xj^, 

The  inverse  of  this  synmetric  matrix  is  simply 


^   /var  X2        -Cov(Xj^,X2)  ^ 

N(var  Xj^  var  X2  -  (eov(X^,X2))  )     ^ -Cov(Xj_»X2)    Var  Xj_  / 


To  evaluate  x'y    we  simply  tnultiply 

yi-y\  tyx' 
ys  -  y  ^ 

77't 
xvi.11.393 


ERIC 


to  get 


But  this  Is 


Nowj  ^'X)"^  can  be  written 


f  VarXj      -Cov(Xj_,X2)\    / Cov Ci /^A 
^  V  -eov^Xj.Xj)      VarX^  j      CovCY.X^)  j    [  h  j 


(VarX^VarX^  -  iCovCX.^,X^) 
Multiplying  out  and  setting  up  equations  for  bj^  and  b^  gives 

 1  —2  (VarXj  Cov(y,Xj_)  -  Cov(X^,X2)Cov(Y,X2)) -  b^ 

VarXj  VarXg  -(Cov(X,X2)) 

and 

 1  ^(yark^  eovCY.Xj)  -  CovCXj/Xg)  Cov(Y,Xj))  «  b^ 

VaSXj^  VarX2  -(CovCXj^.X^)) 

By  ex^ntng  these  equations  we  can  see  two  important  aspects  o£  least  squares 
estiinatton.    Fiist,  if  the  two  variables  X^  and  ^  are  identical  then 
Cov(X^.X2)  will  be  Cov(Xj,Xj_),  and  this  equals  Var  Xj.  Consequently, 
the  difference  VarX^  VarXg  -  (CovCX^.X^))^  will  reduce  to  (VarXj_)^  -  (VarX^)^  ■ 
and  there  wiil  be  no  solution  (or  an  infinite  ni^er  of  solutions)  to  the 
equations  for  the  b  coefficient.    Obviously,  when  the  denominator  is  close 
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to  D  <wheh       and       ^re  very  siinllarj  computers  will  begin  to  have 
ptfiblems  giving  precise  answers. 
Oh  the  other  hahd^  when  the  two  X  variables  are  hot  lihearly  related  at 
all  then  Cbv(Xj^iX2)  ^  0.    Cbhsequehtly ^  the  equatibh  for        (ahd  similarly 
for  bj)  reduces  to: 

^X,  VarX^     ^VarX^  CovCY.X^??^ 

Or 


Cov(Y,X^) 
~virX~  ^1 


Writing  this  but  yields 

£(X,^  -  X,)(Y,  -  Y) 

which  is  the  equatibh  we  obtained  for  the  univariate  situation.  The 
equatibh  fbr  b^  would  be 

 —  ::— I          »  bj  • 


These  equatibhs  yield  the  saxne  value  for  t  as  would  be  obtained  if 
individual  univariate  regressions  were  run.     Ih  geheral  we  ol)5erve  that 

the  least  squares  solution  of  the  p  variable  multiple  regression  situation 

will  yield  the       values  as  p  individual  uhivariate  regressions  when  all 

variables  have  zero  covariances.     (This  will  not  be  true  for  b^  why 

hot?)    Wheti  the  X  variables  are  not  strictly  statistically  independent 

(i.e.,  when  the  covariances  of  the  X  variables  do  hot  equal  zero)  the 

muitiple  regression  solution  and  the  uhivariate  regression  sblutibhs 


w**  w 

Whit  to  look  for  la  rettdtng  technical  reports 


Btieral 


hit  is  theproblan  being  addressed? 

B  tt  a  substantive  or  inethpdologlcal  Issue? 

8  tt  a  baste  or  secondary  liiue?  „  ~       .  ^    ^  o 

i  tt  an  applied  iisue^  a  theoretical  issue,  or  a  combination?  ^ 

s  it  part  of  an  astablished  research  tradition  or  does  It  stand  a7.one? 

an  you  see  any  relevance  for  your  concerns? 


ata 

hat  are  the  data?  _ 
here  do  they  come  from? 

Ion  were  they  gathered?      ,   -     ...  , 

re  there  "data  problems"  (missing  values,  poor  samplxng,  poorly  defined 

measures^  etc. J?  z   .  

re  the  dita  relevant  to  the  problem  addressed  or  are  there  better  sources 

of  information  oh  the  topic?  ,   ^  ^ 

re  the  data  available  if  yoo  wanted  to  pursue  the  analysis? 


[ethod 

r 


hat  procedure's)  was  used?  o 
;8  the  analytic  procedure  appropriate  to  the  data? 
ttll  it  speak  to  theprobiem  addressed? 
io  you  understand  it?  ^  z- 

lave  you  used  it  before  or  ts  tt  new?_  ^    ^    ,  ^  a 

s  Its  application  here  novel  and  innovative  or  typical  and  expected? 

[re  there  other  procedures  which  could  .  have  been  used? 

>id  exploration  precede  confirmation? 

If  nbti  do  you  believe  the  analysis? 

:ah  the  method  be  applied  to  other  areas? 


EKLC 


!te  suits 

rhat  are  they?  _^  

lave  you  learned  anything? 
[i  it  important? 

id  you  believe  it?        __  ^  

[s  it  relevant  to  your  own  concerns? 

^ySu  believe  the  re.uit.  Me  robuit  ind  will  hold  out.ide  of  the  limited 
context  of  tht*  etudyj  l.e.i  can  they  be  generalised? 
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Htgcellanaouii 

Was  It  worth  the  effort?       _  _ 

Is  it  e  landmark  study  useful  for  citation  in  other  contexts  or 

__better  fo^otten? 
Has  the  attthbr(8)puhll8hed  anything  else  that  you  might  follow-up? 
Are  the  references  useful?  _ 
Was  the  presentation  adequate  and  convincing? 
eaa  the  data  he  mined  for  other  issues? 
Are  there  any  outstanding  problems? 
Were  all  the  questions  raised  dealt  with? 
What  are  the  directions  for  future  research? 
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S5ie  Principles  of  Graphics  for  Scatterplots 

This  handout  U  a  continuation  of  the  Module  I  handout  conceding 
:  some  standards  for  graphics.    The  earlier  handout  was  concerned  with 
tables  and  charts;  this  handout  focuses  on  scatterplots.  or  graphic  dis- 
plays Of  CX.Y)  paired  observational  data  sets.    As  before,  some  of  these 
principles  are  due  t5  Edward  R.  Tufte. 
The  principles  discussed  here  are: 

(1)  Less  is  more;  . 

(2)  Suppressing  the  frame  and  grid; 

(3)  Pay  attention  tO  details; 

(4)  Friendly  lettering  and  other  aesthetic  considerations: 

(5)  Using  parallel  plots. 

we  intersperse  our  text  with  many  examples,  both  good  and  bad.  taken 
from  the  pages  of  th.        York  Tfanes,  Busina^ek.  and  several  scien- 
tific journals.    As  mentioned  in  the  earlier  handout,  we  feel  that  paying 
attention  to  the  details  expounded  upon  in  these  pages  will  help  you 
produce  good  displays. 
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Principle  1:    tess  is  More 

"Less    is  more"  is  the  first  principle  in  both  this  handout  and  the 
earlier  graphics  handout.     It  is  also  the  most  important.    Why  should  we 
waste  pages  and  pages  of  text  when  a  simple  and  explicit  graphical  dis- 
play suffices? 

Scatterplots,  especially  of  time  series  data,  are  used  in  reports 
and  articles  to  a  greater  extent  today  than  ever  before.     They  increase 
in  value  as  more  authors  recognize  their  usefulness  in  presenting  and 
summarizing  the  relationship  between  two  quantifiable  variables- 

As  an  illustrative  example,  consider  the  scatterplbt  shown  in 
Figure  1. 

Tufte  plotted  the  work  load  of  the  Publications  Distribution 
Service  for  the  U.S.  House  of  Representatives  as  a  time  series.  This 
aipliy  has  an  immediate  impact  on  the  readei.     There  are  dramatic  peaks 
in  the  data  every  second  October  ~  right  before  Election  Day! 

The  New  York  Times  published  a  very  involved^  700  word  article 
to  describe  this  phenomenon  (N.Y.  Times,  June,  1975,  p.  28).    A  display, 
such  as  that  prepared  by  Tufte,  could  certainly  have  shortened  the 
article  and  enhanced  reader  enjoyment. 
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FIGURE  I 

U,  HOUSE  OF  REPMSEpAtiVES!  WBLICMS  DiSTRIBilTioii  SERVICE, 
mttlONS  OF  WORK-ilTS  PER  MOtltH,  1966-1972 
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Principle  2:    Suppressing  the  frame  and  Grid 

Considerations  when  making  a  scatterplot  were  discussed  in  tBe 
Prerequisite  Inventory,  Module  i,  and  Chapter  A  of  Hblgiiti's  forthcoming 
text  A  First  Cburse^n  Data  Analysis.    We  want  to  emphasize  that  final 
versions  of  scatterplots  should  not  contain  the  graph  paper  grid,  nor 
should  the  frame  of  the  paper  be  included. 

Consider  Figure  2,  a  graphic  that  violates  this  principle.    It  is 
difficult  to  find  the  points  in  the  grid.    Figure  3  shows  the  same  plot, 
xirst  with  grid  (and  quite  a  few  of  the  points)  r-.ippressedi  and  then 
second i  underneath  the  first,  with  grid  and  frame  erased.    The  first 
plot  greatly  simplifies  the  relationship  between  X  and  Y  by  summarizing 
it  with  a  line  (where  is  the  equation?)  and  a  few  token  points  (perhaps 
too  few).    The  second  plot  is  slightly  better.    Comments  are  Tufte's- 


A  final  note:  These  displays  show  the  relationship  between  actual 
registration  rates  and  p^idicted  rates.  Isn't  this  relationship  one  of 
"actual  data  values"  to  "fitted  data  values"?  How  can  we  better  analyz 
these  data,  using  techniques  introduced  in  class? 
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FIGURE  2 

L- '  ^^^.tionrt^^  or  ActuiO  natci  of  nrgistratlon  lo  Predicted  R.tes 
(104  cities  1000). 


sou;;*:  StanU^mi^;^^^ ^^-^  "^^"0^^^ 

«Reiist«tion  and  Voting:  ^""tng  First  Things  

fietenee^Review.-  61  (June,  1967),  p.  371. 
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•  ■    ACTUAt  SHARE  OF  THE  VOTE  RECEIVED 
BY  HOUSE  CANDIDATES,  PRESIDENT'S  PARTY 


FSSURE  3 

ACTUAL  AND  PilEDXCTED  SHARE  OF  THE 
TWO-PAWT  VOTE  RECEIVEO  BY 
CONGRESSIONAL  CANDIDATES  OF 
PR£SXZ»SNT*S  PARTY 
I 


Scrurce:  Edward  K«  tufte, 
**Detcralnsne)i  of  the  Outcome 
of  !ttdcern  Congressional  Elections," 
American  Political  Science  Rcviev. 
69  (September »  1975}^  p.Slg  . 


sxtrc  dlgiti  not  -  - 
Bieded;  '*.0"  ihbuld 
be  deleted  from  eAch 
ttuabet 


«M  «M  ML*         ie.9  u.*  Itlf  m!«  m[«  M4  mS"! 


Source:  Stltsley  Kelley,  Jr.^  Wchird  E.  AyTea,_#nd  WilltacD  C.  Bo%ren, 
Regiitretipn  end  Voting:  Putting  First  Thitigi  Fitic,"    Aia^rican  PoUtieal 
Science  Keview,^!  Wutie,  1967) ;  figure  fro.  repented  version  in  Edward 

«d.,  Th^H»ntitat4ve^AnAlvsii  of  Social  Problrms  (Readine. 
Massechueetts:  Addiaon-Uesicy.  1970),  p«  267. 
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Principle  3:    Pay  attention  to  Details 

By  giving  the  details  of  a  scatterplot  peeper  attention *  a  good 
display  csa  be  madi  even  better.    If  the  graphic  can  be  reproduced 
in  color,  then  by  all  means,  make  the  points  black,  the  scale  green, 
and  the  fitted  line  red.    Choose  a  good  symbol  to  use  for  plotting 
points..  We  p^efijc  X  over  -  because  the  fomer  syn^bl  is  larger.  If 
you  have  different  types  of  points,  use  different  symbols  to  highlight 

the  differences i 

Consider  Figure  5  from  an  article  published  in  Science.  The 
authors  are;  analyzing  sex  discrimination  in  graduate  school  admissions 
at  Berkeley.    They  have  plotted  the  percent  women  admitted  (Y).  versus 
percent  woien  accepted  (X).  one  "box"  per  department.    Note  that  Se 
size  of  the  boxes  is  related  to  the  total  number  of  applicants  to  the 
department.    Here  are  some  critical  conmehts: 

1      No  detailed  scale  for  sizi  of  box  given.    Why  not  label  the 
largest  boxes,  and  give  total  number  of  applicants? 

2i  Why  is  the  tDinimum  box  size  <  40? 

3.  Fitted  line  does  not  fit  the  data! 

4.  Lettering  is  of  poor  quality. 

5.  Could  we  improve  on  the  use  of  the  boxes? 
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FIGURE  5 
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'DNumbfip  of  applic«nis5  40 
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.   ptretnt  womon  ipplictnu  _  ■  • 

^  :^    ,  PrO|K>rtidn  of  ippiicafits  that  att  weimeii  plotted  asainst  proportioo  of  appli* 
^canti  admittediJh  15  depanments.  Size  of  box  indicates  relative  number  of  applicanu 
/  loS&e«!cj«rtineot.  ;  . 


Source:  P.  Ji  Blckel,  E.  A,  Hammel^  and  J.  QVCbnnell,  "Sex  Bias  in  Graduate 
Admissions:  Data  Froto  Berkeley^"  Science^  187  (February  7,  1975),  p.  400. 


7§7 

XVIiii.465 


ERIC 


QMPM 


Principie  4:    Friendly  tettering^  etc. 

this  prtnctpie  is  briefly  stated:    pay  attention  to  the  "aesthetic'' 
details.    "Friendly  iettertng"  is  a  good  example.    Instead  of  typing 
coments  on  the  display,  letter  them  by  hand.    If  necessary,  let  a 
professional  do  it.    Graphics  by  Roger  Hayward  are  good  examples  of 
friendly  plots ,  as  shown  in  the  plots  of  Figure^^  taken  from  a  Chemistry 
text.    Hayward  is  best  known  for  his  graphics  work  found  in  the  "Amateiir 
Scientist"  section  of  Scientific  American, 
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Principle  5:    Using  Parallel  Plots 

Parallel  plots  are  useful  in  summarizing  complex  data  sets.  The 
graphs  in  Figure  7  are  from  Business  Week  and  are  quite  attractive. 
Business  Week  usually  produces  very  good  displays  of  economic  time 
aeries  data. 
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Module  III  of  the  Quantitative  Methods  for  Public  Management  pack- 
age contains  three  units,  numbers  5,  6  and  7,    Unit  5,  Probability  and 


variables.    The  relative  frequency  approach  to  probability  is  emphasized 
and  a  sampling  experiment  is  used  to  provide  a  concrete,  empirical  feel 
for  this  fundamental  idea.    Following  the  introduction  of  prbbability 
notions,  the  concept  of  a  random  variable,  a  variable  which  takes  on 
values  with  aiSsociated  probabilities  is  Introduced.    Distributions  for 
selected  random  variables  are  then  discussed  with  special  emphasis  placed 
on  those  random  variables  traditionally  associated  with  linear  models.  The 

shape  of  distributions  is  illustrated  using  the  graphics  tools  of  Module  I* 
Distribution  Is  felt  to  be  important  to  future  public  managers  because 
toany  policy  relevant  analytic  situations  require  knowledge  of  the  charac- 
teristic shape  and  moments  of  a  variety  of  well  known  random  variables. 
Simple  computing  probabilities  of  events,  such  as  the  probability  of 
finding  a  physician  in  a  census  tract  when  the  physician  distribution 
is  known  to  be  Pbissbh^  requires  knowledge  of  how  the  random  variable 
behaves.    The  distributions  introduced  in  Unit  5  cover  those  most  likely 
to  be  encountered  in  the  field.    Furthermore,  exposure  to  these  selected 
distributions  prepares  the  student  to  use  others  as  the  occasion  arises 
In  the  cbmputatibh  of  moments  and  the  performance  of  such  inferential 
procedures  as  constructing  confidence  intervals  and  hypothosis  tests. 
Numerous  examples  in  presentation  material,  homework,  and  exams  in 
Module  III  illustrate  the  everyday  utility  of  applied  distribution  theory. 


Sampling »  introduces  the  student  to  the  notions  of  probability  and  random 
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tJnit  6,  inference  j  introduces  the  student  to  the  use  of  probability 
n5ti5iii  to  determining  the  precision  of  parameter  estimates  and  testing 
hypotheses.    The  material  cdversed  here  is  traditional  statistical  infer- 
ence.   Hbvever,  students  are  cautioned  against  blind  use  of  these  pro- 
cedures through  careful  consideration  of  the  stringent  assumptions  Implicit 
in  the  approach. 

Unit  7,  Sample  Surveys,  departs  from  the  usual  material  taught  in 
QMPM  in  that  it  concerns  data  collection  rather  than  data  analysis.  The 
unit  intS5duces  the  student  to  the  use  of  surveys,  the  design  of  questions, 
questionnii^e  layout  *  fielding  procedures  and  sampling  designs.  Emphasis 
is  placed  oh  the  utility  of  surveys  in  policy  analysis  and  their  short- 
comings.   The  objective  of  the  unit  is  to  create  intelligent  consumers 
of  suiviy  reports  rather  than  skilled  survey  Researchers.     Since  this  is 
a  vast  area  covered  in  only  three  lectures,  it  is  Implicitly    assumed  that 
when  the  student  becomes  a  practitioner  and  has  need  for  a  survey  a  pro- 
fesBiohal  organization  with  experience  In  conducting  survey  research  will 
be  retained. 

Specific  Objectives 
Hnit  5 

Upon  successful  compieti5n  of  Unit  5  a  student  will  acquire  an 
understanding  of  what  is  meant  by  the  mathematical  notion  of  probability 
and  will  be  able  to  use  this  notion  in  the  study  of  random  variables  and 
their  applications.    The  student  will  be  able  to  specify  the  distributions 
of  selected  continuous  ind  discrete  random  variables  including  the  Normal, 
rectangular,  exponential,  uniform,  binomial  and  Poisson.    The  student 
will  also  be  able  to  compute  first  and  second  moments  for  random  variables 
with  these  distributions  and  to  recognize  when  empirical  data  are  likely 
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to  be  observations  on  random  variables  with  these  distributions.  The 
student  will  also  be  familiar  with  the  t,       and  F  distributions,  how 
they  are  related  to  each  other  and  to  the  Normal,  and  how  they  arise 
In  a  linear  model  estimated  from  san5)le  data. 

Unit  6 

Upon  successful  completion  of  Unit  6  a  student  will  be  able  to  apply 
p^5bability  notions  in  the  performance  of  statistical  inferential  pro- 
6eidures.     The  student  Will  be  able  to  apply  knowledge  of  probability 
distributions  and  moments  to  compute  confidence  intervals  and  .confidence 
levels  using  known  random  variables.     The  student  will  also  be  able  to 
construct  hypotheses  tests  and  specify  significance  levels  for  tests. 
The  student  will  have  learned  to  perform  these  operations  on  single 
parameters  such  as  the  mean  of  a  sample  and  on  coefficients  estimated  in 
a  multiple  regression  equation. 

Unit  7 

Upon  successful  completion  of  Unit  7  a  student  will  be  able  to  recog- 
nize a  situation  requiring  the  use  of  a  sample  survey  and  to  design  and 
field  a  simple  survey  instrument.     The  student  will  also  have  developed 
a  critical  capacity. permitting  effective  review  of  survey  instruments 
and  results  and  will  be  able  t5  compute  elementary  statistics  to  estimate 
precision  in  the  case  of  siSple  random  sampling.    The  student  will  also 
be  able  to  identify  features  of  more  complicated  probability  sampling 
proceedures  such  as  cluster,  stratified,  systematic,  or  multistage 
sampling  and  will  be  able  to  assess  their  advantages  and  disadvantages 
in  particular  situations.    The  student  will  also  be  able  to  assess  the 
benefits  and  disadvantages  of  various  fielding  methods  such  as  face-to- 
face,  telephone,  and  mailed  response  intervl^;^.^ 
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Lecture 
Lecture  5-0 

Lecture  5-1 
Lecture  5-2 


Lecture  5-3 


 Bnit  5 

Reading  Assignments 


Ass 


Hosteller,  Rburkei  and  Thomas  Chapter  3 
Mueller,  Schuessler*  and  Costner,  Chapter  11 

Tufte  ,  Chapter  2 

Hosteller^  Rburke,  and  Thomas,  Chapters  5 
and  7 

Draper  and  Smith,  Chapter  2 


In  addition^  read  the  following  articles  In  Tanur^  et.al< 


pages  102-11 
164-75 
212-19 
244-52 
372-84 
407-15 


Texts: 

Draper,  K.  and  H.  Sidth,  Anpiled  Regression  Analysis >  New  York: 
John  Wiley  &  Sons,  1966. 

Mostelier,  F.^  R.  Rourke,  and  G.  Thomas  *  Probability  wi^^atistical- 
Applications,  Second  Edition,  Reading,  Massachusetts:  Addtson- 
Wesley,  1970. 

Mueller,  J. ^K.Schuessler,  and  H.  Costher,  Statistical  Reasoning 
in^Socioiogy,  Third  Edition^  Boston:    Houghton  Mifflin, 
1977. 

Tanur,  J.,  et.al..  Statistics;    A  Guide  ta  t44e^nkaown,  San  Francisco: 
Hoiden  Day,  1972. 

Tufte,  E.  R. .  Data  Analysis  for  Politics^ and  Policy,  Englewood  Cliffs, 
New  Jersey:    Prentice-Hall,  1974. 
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Prerequisite  Inventory 
Module  Hi 

This  prerequisite  inventory  contains  a  brief  introduction  to  the 
vocabulary  and  notation  of  elementary  probability  and  set  theery.  If 
you  are  still  uncertain  about  any  of  these  concepts  after  reading  the 
inventory,  please  consult  a  member  of  the  teaching  staff. 

Probability  is  a  measure  of  chatice.    Discussions  of  chance  are  by 
no  means  limited  to  the  classroom.    The  weatherman  often  states  that 
there  is  a  60%  chance  of  rain^  or  a  friend  might  remark  that  •♦chances 
are  I'll  be  home  late  again  tonight."    Or,  someone  else  may  state  that 
it  is  likely  that  federal  income  tax  will  rise  this  year. 

in  spite  of  our  familiarity  with  chance,  it  is  difficult  to  come 
up  with  a  rigourous  definition  of  probability.     Mathematicians  and 
philosophers  cannot  agree  dti  a  single  definition.     One  group  of  statis- 
ticians believes  in  bblec^^ve- P^robabili^.    Objective  probabilities  are 
derived  from  repeated  observations  of  the  happening  that  is  in  question. 
The  long  run  relative  frequency  with  Which  the  happening  occurs  Is  taken 
to  be  its  probability.    For  example,  suppose  you  work  for  a  health 
agency  which  is  testing  the  effects  of  red  dye  #2,  a  suspected  carcinogen, 
bn  rats.    You  have  tested  red  dye  #2  on  3G00  identical  rats.  Five 
hundred  of  those  rats  have  developed  cancer.    You  conclude  that  the 
probability  that  the  nwct  »at  to  receive  red  dye  #2  will  develop  cancer 

is  500/3000,  or  1/6. 

Probabilities  may  be  based  bn  equally  likely  chances.  The  tradi- 
tional examples  ©f  this  type  of  probability  are  the  flip  of  a  coin  and 
the  roll  of  a  die.    When  a  coin  is  tossed/heads  and  tails  are  equally 
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Itkeiy;  each  has  probabiiity  t/2«    The  sides  of  a  die  are  equally  likely^ 
so  the  probabiiity  that  a  particular  side  will  appear  is  1/6 •  Suppose 
that  we  ^nt  to  meet  with  some  physicians  in  Buffalo  to  discuss  their 
opinions  on  malpractice  insurance.    We  know  that  physicians  are  distri- 
buted unevenly  across  census  tracts.     Let's  assume  that  there  are  1660 
physicians^  of  which  only  1  is  in  the  first  census  tract.     If  we  select 
the  first  physician  that  we  meet  with  at  random,  then  the  probability 
of  selecting  each  particular  physician  is  equal,  namely  a  1  in' 1066 
chance,  a  probability  of  .001.    With  only  one  physician  in  the  first 
census  tract,  we  can  also  say  that  the  probabiiity  that  our  first 
physician  is  from  that  tract  is  .001.     if  there  are  18  physicians  in 
Census  tract  two,  then  the  probability  that  our  first  physician  is  from 
that  tract  is  .018. 

Another  definition  of  probability  is  subjective  or  personal  pro- 
bability.   Subjective  probabilities  are  based  on  personal  belief  or 

professional  judgment.     Not  all  statisticians  accept  the  idea  of  subjec- 
tive probability^  but  some  are  willing  to  assign  tneasures  of  chance  to 

happenings  that  cannot  be  repeated  to  obtain  a  long-run  frequency.  In 
1977,  the  secession  of  Nantucket  and  Martha's  Vineyard  from  Massachusetts 
seemed  possible.     We  cannot  repeatedly  put  Martha's  Vineyard  and  Nantucket 
in  a  position  to  secede*  and  count  the  number  of  times  that  they  actually 
do  secede.    Nor  is  secession  one  of  a  number  of  equally  likely  eventa. 
It  may  be  possible,  however >  for  us  to  analyze  the  existing  conditions 
which  affect  the  secession  decision  and  to  state  our  opinion  that  there 
is  a  20%  chance  that  the  islands  will  secede  within  the  next  ten  years. 
This  207o  is  an  example  of  a  subjective  probability. 

SOI 
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Statisticians  do  have  a  precise  language  for  describing  the  happenings 
to  which  probabilities  are  assigned;    An  experiment  is  ah  act  that  can 
be  repeated  under  given  conditions.     In  the  exan^les  abbve^  testing  red 
dye  #2  on  a  rat^  flipping  a  cdin^  rolling  a  die^  and  selecting  a  physi- 
cian to 'interview  are  all  experiments.     Secession  from  Massachusetts 
is  difficult  to  repeat  and  is  therefore  not  ah  experiment. 

An  experiment  has  one  or  more  possible  outcomes .    We  will  rarely 
be  interested  in  experiments  with  one  outcome >  since  that  outcome  occurs 
with  probability  1  (certainty),     the  outcome  of  flipping  a  coin  must  be 
either  heads  or  tails.     The  outcome  of  giving  red  dye  #2  to  a  single 
rat  must  be  either  cancer  or  not  cancer.    The  outcome  of  randomly 
selecting  a  physician  from  1000  must  be  one  of  the  1000  physicians. 

Ah  elementary  event  Is  the  outcome  of  ah  csxperiment.    We  will  often 
be  interested  in  more  complex  events  which  incorporate  more  than  one 
outcome.     These  are  called  compound -events .     In  the  red  dye  #2  example^ 
we  do  not  care  whether  or  hot  a  particular  rat  develops  cancer.  Rather, 
we  are  Interested  In  whether  or  not  a  sufficient  number  of  rats  develop 
cancer  so  that  we  can  conclude  that  red  dye  #2  is  harmful.     Suppose  we 
have  decided  to  conclude  it  is  harmful  if  more  than  half  of  the  rats 
get  cancer.     This  event ^  more  than  1/2  of  the  rats  developing  cancer,  is 
a  canbinitlon  of  the  elementary  events  that  1501  or  1502  or  1503,  and 
so  on  up  to  3000,  rats  develop  cancer. 

A  sefe^  is  a  collection  of  numbers  or  objects  that  can  be  grouped 
together  in  some  context.    We  may  talk  about  the  set  of  all  black  city 
managers  or  the  set  of  lifetimes  of  General  Motors  automobiles.  The 
elements  of  a  set  are  listed  between  brackets  {     }  and  are  separated 
by  commas.    A  set  of  lifetimes  of  automobiles  in  months  may  look  like 
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{68.  73.  79.  80.  83.  83.  91}. 

The  possible  outcomes  of  an  experiment  are  often  listed  in  set  no- 
tation.   When  this  is  done,  the  set  S  is  called  the  sample  space  of  the 
experiment.    The  sample  space  of  the  experiment  of  testing  red  dye  #2 
on  one  rat  is 

{cancer,  not  cancer}. 

The  sample  space  of  the  experiment  of  testing  the  effects  of  red  dye  #2 
on  3000.  rats  is 

{no  rats  get  cancer,  1  rat  gets  cancer ^  2  rats  get 
cancer,  3  rats  get  cancer,  3000  rats  get  cancer} 

In  large  sets  containing  an  obvious  progression^  we  use  ...  to  stand 
for  "and  so  forth  up  to'*  followed  by  the  final  element  in  the  set.  The 
set 

{1,  2,  3,  ...} 

is  the  set  of  positive  integers  and  is  of  infinite  length.    (There  is  no 
"final"  element-) 

A  subset  of  a  set  S  is  made  up  only  of  elements  in  S.    Subsets  may 
contain  ail  of  the  elements  in  S  or  any  number  of  eleiuents  less  than  the 
total  number.    The  set  which  contains  no  elements  is  called  the  empty 
set  and  is  symbolized  by  {  }  or  0.    The  empty  set  is  a  subset  of  every  set. 
An  event  (see  above)  is  a  subset  of  a  sample  space. 
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We  may  want  to  describe  the  union  or  the  intersection  of  two  or 
fibre  sets.    The ^tatersection  of  sets  consistJ?  of  the  elements  which  the 
sets  have  In  coofflon.    The  symbol  for  intersection  is  n.    For  example, 
consider  A  and  B  \*ere  A  is  the  set  of  grades  on  a  test  in  a  classroom 
with  co^ttter-aided  instruction  and  B  is  the  set  of  grades  on  a  similar 
test  in  a  .traditional  classroom: 

A:  {80,  83,  86,  86,  89,  93} 
B:    162,  70,  79»  83,  89,  90} 

Then  An  B  =    ?83,  89} 

The  union  of  sets*  symbolized  U,  enumerates  all  of  the  elements 
that  appear  in  one  or  more  of  the  sets.     Referring  to  A  and  B  above, 

AuB  «    {62,  70i  79, 'so,  83,  83,  86,  86,  89,  89,  90,  93} 

We  say  that  two  sets,  C  and  D,  are  mataally  exclusive  if  they  have 
no  elements  In  common.     This  is  equivalent  to  the  mathematical  statement 
e^O  =  {  }.     The  set  of  U.S.  Senators  and  the  set  of  U.S.  Representatives 
at  one  point  in  time  are  mutually  exclusive  sets. 

Suppose  that  E  is  some  event  which  is  a  subset  of  the  sample  space 
S.  Then  the  set  I*  consisting  of  all  elements  in  S  which  are  not  in  E, 
is  called  the  complement  of  E.     Complements  have  the  properties  that 

{EriE}    =    {  } 
and         {EUE}    =      S,  the  sample  space. 

We  will  be  discussing  the  probability  that  a  particular  event 

occurs.  The  probability  Of  an  elementary  event  is  the  likelihood  that 
the  experiment  will  ^eiult  in  that  outcome.    The  probability' associated 

with  an  entire  sample  space  is  1. 


QMPM 

You  wiii  see  different  notations  used  to  denote  the  probability  of 
an  event.    The  most  conmon  are  PCeventl,  Prtevent}*  P(eveht)>  and 
Pr(event).    When  we  roll  one  die,  P{5}  =  1/6  and  P{an  even  number}  -  1/2. 

A  popula^iOR  is  a  group  of  people  (or  things)  specified  by  some 
characteristic.    A  population  may  be  ail  people  in  the  United  States, 
35  -year-old  congf-csswdtncn  from  Poughkcepsie ,  or  m"en  with  Income  greater 
than  $50^000.    A  s^ple  is  a  subset  of  an  entire  population. 

In  Module  III,  we  will  use  our  knowledge  of  probability  to  generalize 
from  samples  to  populations.     Probability  will  enable  us  to  quantify  the 
uncertainty  about  the  population  which  is  due  to  our  samplings  i.e.^  our 
oljservatidn  of  only  part  of  the  set  consisting  of  all  members  of  the 
population.    Unless  we  analyze  the  entire  population,  there  will  always 
be  uncertainty* 

A  sample  is  a  batch  of  data,  and  we  continue  to  use  measures  of 
location  and  scale  that  were  discussed  in  Module  I.     In  particular,  the 
sanq)le  me:an^  X,  and  the  sample  standard  deviation,  s,  will  be  used.  It 
is  a  useful  jprdperty  of  sampling  that  as  the  size  of  a  sample  increases, 
both  ill  absolute  number  and  relative  tc  the  population  size,  the  sample 
mean  approaches  the  mean  of  the  population. 

A  final  concept  with  %*ich  to  be  familiar  before  prcceediag  to 
Module  ill  is  the  difference  between  continuous  and  discrete.    A  discrete 
variable  may  cake  on  one  cf  a  finite  or  couhtably  infinite  set  of  values. 
The  liurber  of  black  city  managers  and  the  set  of  positive  integers  are 
examples  of  discrete  variables.    A  continuous  variable  may  take  on  values 
from  a  set  consisting  of  an  interval  of  the  real  number  line.    Length  of 
s  particular  road,  average  height  of  European  men,  and  all  numbers  between 
d  and  1  are  »amples  of  continuous  variables. 
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:  Homework 
Prerequisite  Inventory,  Module  III 

Questions  l-lo  refer  to  sets  A  through  D. 

A:  {0,  1,  2} 

B:  {1,  3,  5,  7,  9} 

C:  {2,  4,  6,  8,  10} 

D:  {1,  2,  3,  4,  5} 

List  the  elements  of  the  following  sets  (in  set  notation): 


1. 

AUD 

6. 

(Ata5n(Aoe) 

2. 

AnD 

7. 

(A/»5t>(Ane) 

3. 

BnD 

8. 

4. 

BUG 

9. 

AflBnD 

5. 

BUG 

lb. 

(B(1D)t/(eUA) 

In  questions  11-15,  give  the  sample  space  of  the  described  experiment, 

11.  Annual  faitiy  iScbme  for  a  family  in  Detroit,  given  that  the  head 
of  the  household  is  chairman  of  the  board  of  a  major  automobile 
manufacturing  conrpahy. 

12.  •  Lifetimes  (in  years)  of  individuals  in  Washington  D.C.  who  died 

between  1960  and  1970  from  a  heart  attack. 

13.  The  number  of  cars  passing  a  building  in  a  one  hour  period. 

14.  Percentage  of  black  students  in  your  master's  class. 

15.  the  nunfcer  of  women  in  a  room  of  10  people 
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Homework  Solutions 
Prerequisite  Inventory,  Module  ill 

1.  {0,  i,  2,  3,  A,  5} 

2.  {i,  2} 

3.  {1»  3.  5} 

4.  {  } 

5.  {!»  2,  3,  4,  5,  6,  7,  8,  9,  10} 

6.  {Oi  1»  2,  4} 

7.  {1*  2} 

8.  {0,  1.  2,  3.  4,  5.  6,  7,  8,  9,  10} 

9.  {1} 

10.  (0,  1.  2,  3.  4,  5,  6,  7,  8,  10} 

11.  Set  contains  3.  incomes  (one  for  GVL,  Chrysler,  Ford),  each  element 
±n  excess  of  $200^000 

12.  Set  cSntaihs  pbsltlye  integers,  between  20  and  90  (approximately) 
with  most  elements  between  40  and  60. 

13.  [b,  1,  2,  ...] 

14.  If  ciasi  has  N  students,  set  is  [07.,  ^  x  ldO%,  ^  x  1007., 
^  X  1007.,  1007.] 

15.  [0.  1,  2,  3.  10] 
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Lecture  5-6.    introduction  to  Unit  5 
introduction  to  Unit  5,  Probability  and  Sampling 
Lecture  Content; 

1.  introduction  to  objectives,  problem,  and  notation  for  Unit  5 

2.  Discussion  of  a  quantification  of  the  notion  of  uncertainty 

Main  topics; 

1.  Specific  Introduction  to  the  Objectives  of  Unit  5 

2.  Presentation  of  general  problem  of  Unit  5 

3.  Notation  for  Unit  5 

4.  Definition  of  Probability 
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topic  1.    Specific  introduction  to  the  Objectives  of  Unit  5 
1,    Questions  to  be  answered  in  Unit  5 

1.  What  is  probability? 

In  a  frequency  context,  the  probability  of  an  event 
is  the  proportion  of  time  that  the  eVerit  occurs  on  a 
large  number  of  trials;  e.g.,  P{head  On  a  coin  toss J 

b.    However,  probability  is  also  a  subjective  notion  and 
may  vary  from  person  to  person  —  chance,  uncertainty, 
probable,  likelihood  of  occurrence 

2.  How  do  we  calculate  and  manipulate  probabilities? 

a.  We  define  simple  iales  for  finding  probabilities  of 
equally  likely  outcomes 

b.  Intersections  and  unions  of  events  are  easily  visualized 
via  Venn  diagrams,  and  the  corresponding  probabilities 
found 

3.  What  is  a  random  variable,  and  how  do  we  determine  and 
utilize  its  probability  distribution? 

a.    A  random  variable  is  a  variable  whose  specific  value 
is  not  known  with  certainty 

b     A  random  variable  is  characterized  by  its  probability ^ 
distribution, J*tch  gives  probabilities  that  the  variable 
will  have  certain  values. 

c  The  probability  distribution  allows  us  to  make  certain 
statements  about  the  random  variable.  For  example,  we 
niay  calculate  the  average  value  of  the  random  variable. 

A     What  are  soie  eiiSples  of  random  variables?     [Random  ^ 
*    variables  (usually)  are  either  discrete  or  continuous.] 

a.    Discrete  Random  Variables 

i.    DlscSite  random  variables  take  oS  a  finite  or 
countably  infinite  number  of  values;  e.g.*^ 
number  of  customers  arriving  at  a  supermarket^ 
between  noon  and  1  PM  on  a  given  day  (Poisson) 

ii.    Some  discrete  random  variables  include 
Binomial 
Poisson 

Uniform  809 
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b.    Continuous  Random  Variables 

i.  gontinuous  random  variables  take  oti  any  value  in 
same  interval  of  the  real  line;  e.g.,  heights  of 
individuals  (^Gaussian) 

ii«    Some  continuous  random  variables  include 
Exponential 
Rectangular 
Gaussian 


5.    Sow  do  we  apply  probability  theory  in  data  analysis? 

We  discuss  regression  analysis  with  probabilistic  assump- 
tions for  errors 

Skills  to  be  mastered  in  Unit  5  W 

1.    Quantification  of  uncertainty  through  the  concept  of 
probability 

2i    Identification  of  the  mathianatical  formulae  of  various 
sampling  or  probability  distributibhs 

3.  Recognition  of  several  tjrpes  of  random  variabTes,  and 
the  ability  to  compute  their  expectations 

4.  Assessment  of  goodness  of  fit  in  multiple  regression  by 
utilization  of  probability. 
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Topic  2«    Introduction  to  the  Problems  of  Unit  5 
1.    What  is  probability? 

1.  Experiment:    An  activity  or  procedure  invplvinK  alternative 

outcomes.  Each  has  an  associated  probability.  Repeatable 
under  given  conditions. 
Example:    Flip  a  fair  coin  10  times  and  let  X  "  number  of 
heads. 

2.  Each  outcome  is  ah  events  either  elementary  or  a  combination 
of  elementary  events.    All  possible  outcomes  for  an 
experiment  is  the  sample 

3.  State  that    based  oh  these  10  trials^  Pr{head}  »  x/10 

4.  Note  that  this  definition  of  probability  is  based  onan 
infinite  number  of  trials;  only  as  n  willx/n^  1/2 


II.    How  do  we  calculate  probabilities? 

1.  Let  A  be  an  event 

Pr{A}  must  be  between  0  and  1^  inclusive. 

Intuitive:     0  means  event  doesn't  occur;  1  means  event  does* 

2.  If  the  universe  ^  set  of  all  possible  outcomes  contains  the 
events  A^*  Agt   ...»  A^,  then  Z?t{A^}  »  1. 

3.  By  using  Venn  diagrams  we  can  compute  probabilities  of  (2) 
ihteraections^  unions,  etc.  Intersection  is  '*and"  ,  Union  is 
"or" 

III.    What  are  examples  of  random  variables?  (3) 

1.  Number  of  heads  iii  h  tosses  of  a  coin  *  X 
X  is  a  binomial  random  variable 

2.  Number  of  arrivals  of  airplanes  at  an  airport  in  a  specific 
hour  "  X  _ 

X  is  a  Pdisson  random  variable 

3.  Number  obtained  on  a  single  roll  of  a  die  «  X  (either 
1,  2,  3>  4,  5,  or  6) 

X  is  a  Uniform  random  variable 

4.  Waiting  time  between  arrivals  of  customers  in  a  super- 
market a  Y 

Y  is  an  exponential  random  variable 

5.  Random  variables  occurring  in  nature,  e.g.  intelligence 
scores,  lengths  of  rose  petals  are  (usually)  assumed  to  be 
Gaussian. 

xvi.iii.i?-^i 
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Gonclusioii 

1.  Seed  methods  to  recognize  which  distribution  a  batch  of 
data  came  from 

2.  Need  to  be  well  versed  in  math^atical  probability 
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topic  3*    Introduction  to  the  Notation  of  Unit  5 

1.    We  let  capital  letters  such  as,  X,  Y,  Z,...  denote  random 
variables 

II.    Small  letters  x,  y,  z,         denSti  Realizations  of  these 
variables;  e.g.,  we  write    Pr{X-Tc},  where  x  is  an  element 
of  the  sample  space  of  X. 

1.  Read  as  "the  probability  that  X,  the  random  variable, 
will  take  the  value  x." 

2.  ''Realization"  indicates  a  value  actually  taken  by  a 
random  variable 
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topic  4.  Probability 

i.    In  this  section,  we  define  some  basic  notions  of  probability. 

1.  Probability  is  defined  for  event s^  or  occurrence  of  a 
certain  phenomenon.    Events  are  notated  A,  B,  ... 

a.  A  head  on  a  single  coin  toss  is  ah  event 

b.  3  heads  in  IG  coin  tosses  is  an  event 

c.  A  leaf  density  of  6.93  is  an  event,  Note  P{A}  between 
b  and  1 

2.  The  collection  of  ail  possible  events^  relative  to  a 
specific  experiment,  is  called  the  sample  space  or 
universe.  S  »  {A-,  A^,  A^}    It  depends  on  the 
definition  of  the  experiment. 

a.  Events  are  subsets  of  S 

b.  If  we  toss  a  coin  20  times  and  record  the  number  of 
heads,  then  S  «  {0,  1,  2,  19,  20}  heads 

c.  ZP{A^}  *  1 

3.  Probability  of  an  event  A  is  the  number  of  times  the 
event  occurs  (or  the  number  of  successes)  divided  by 
the  total  number  of  trialsifora  large  number  of 

iA\  m  numb&r-o£-"successes    of  A 
triaxsp^t^J      total  number  of  outcomes 

i.e.,  it  is  a  relative  frequency. 

4.  Suppose  we  have  2  events  A  and  B.    The  event  C  «  AUB  is 
the  union  of  A  and  B,  and  is  the  occurrence  of  either 

A  or  B  or  both  A  and  B 

5.  Suppose  we  have  2  events  P  and  E.    The  event  F  -^  AflE  is 
the  intersection  of  D  and  E,  and  is  the  occurrence  of  both 
D  and  E. 

6.  P{C}  ^  P{AyB}  -  P{A}-+  P{B}  -  P{AnB}.  if  A  and  B  are 
disjoint,  then  PlAUB}  -  P{A}  +  P{B} 

7.  A  is  the  complement  of  A 

P{A}  «  1  -  HA} 

8.  if  2  events  are  independent,  or  have  nothing  to  do  with 
each  other,  then  P{AnB}  -  PtA}P{B} 
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9.    in  general,  when  2_ events  are  not  Independent ^ 

PUrB)  -  P{AiB}P{B}  

^ere  iPlA|B}  Is  the  conditional  probability,  read  A 

given  B* 

iO*    Note  that  P{a1b}  -  PCAPB  }/P{B} 
II.    Discuss  example  (4-5) 


Si  ,5 


ERIC 


XVI.  Hi;  20 


Module  III 


Lecture  5-0 
transparency  Presentation  Guide 


Lecture 

Location 

Outline 

Topic  1* 
Section  II. 
i. 

Topic  2, 
Section  II. 
3. 

Section  III< 
i. 

Topic  4. 
Section  II. 

Section  II. 


Transparency 
Number  


2 

3 

4 
5 


Transparency  Description 
Skills  to  be  mastered 
Venn  Diagrams 

Examples  of  Random  Variables 

Experiient  to  Introduce  Probability 
Probability  Galculations 


Sic 


ERIC 


QMPK 


Sl7 

XVI. III. 22 


ERIC 


Module  III 


ERIC 


Sis 


XVI. III.  23 


Xis    it   immjt^  Am^i^m  VM44ti^ 


^  n  a   fh^A  A0tn^^m  \ra^*^ 

^  ^    «    4in0m  A^n^m  vm4j04^ 


f  .  ^§n^m  mimilfS   9tWiAitvf  fk  ^a-in^tj 

S19 


XVi.III. 24 


ERIC 


mi 


.d5 


.St 


v5" 


820 

XVI.  ill.  25 


ERIC 


S21 


XVI. III. 26 


tecture  5-1.    Sampling  Dlstribattons 


Sampling  DistributionsJ  _  Notion  of  random  variables  introduced  by  means_ 


of  a  sampling  experiment! 


Lecture  Content: 


1.  Discuss  discreti  and  continuous  random  variables,  numbers 
*    determined  by  the  outcome  of  an  experiment   ,  - 

2.  Simple  iaidom  sampling  experiment  from  three  probability  dis- 
tributions 


Main  topics: 

1.  Random  variables 

2.  Sampling  Scperlment 

3.  Sampling  Distributions 
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Topic  1*    Random  variables 


!•    Basic  Issue:    Experimentally  deteimined    .  bers 

1.  Random  variation  arlsies  In  nearly  al    social  and  physical 
science  experiments 

a.  We  may  wish  to  measure  the  numb  .^j:  of  racial  disorders 
occurring  in  inner  city  high  schools  in  Boston 

b.  Or  we  may  wish  to  determine  the  boiling  point  of  a 
certain  chemical  compound 

c.  In  both  instances,  the  computed  quantities  will  vary 
from  school,  to  school^  or  replication  to  replication 

2.  It  is  important  to  quantitatively  define  the  nature  of 
variation  in  bur  experiments 

3.  We  describe  this  variation  in  probabilistic  terms,  to 
indicate  our  lack  of  certainty  in  the  outcomes  of  the 
experiments 


II.     Problem:    How  do  we  characterize  this  variation? 

1.  Probabilities  are  defined  only  for  long-run  frequencies 
of  events,  where  the  experiment  has  been  replicated  many, 
many  times 

2.  It  is  usually  tidt  profitable  for  us  to  conduct  bur  ex- 
periments for  policy  decisions  the  required  number  of 
times 

a.  Generally,  we  are  lucky  to  have  more  than  100  replica- 
tions of  an  experiment,  becauBfi^ of  either  lack  of 
time  and  money,  or  the  small  size  of  the  sampling 
space 

b.  What  limited  inferences  can  we  mak«  on  fewer  than 
100  numbers? 

c.  Stem-and- leaf  displays  indicate  the  nature  of  the 
variation,  but  rarely  do  they  mimic  the  appearance 
of  a  known  distribution. 

3.  Example  illustrates  this  problem  (2) 


III.     Solution:    Attempt  to  describe  the  random  nature  of  the 
variable  by  one  of  the  well-known  probability  models 
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ii    Essentially,  we  borrow  striangth  from  statistics, and 
a^s^e  that  the  variable  in  question  follows  a  known 
probability  model 

2.  this  approximation  is  reasonable  in  many  instances; 
however,  we  mist  remember  that  it  is  just  an  approximation 

3.  The  stem-and-leaf  display  is  our  most  powerful  analytical 
tool  in  determining  which  model  to  assume 

Befinitibhs  (2 

1.  A  random  variable  is  a  variable  whose  value  is  a  number 
determined  by  the  outcome  of  ah  experiment 

2.  If  X  is  a  random  variable,  with  possible  values  x^,  Xg*  ...» 
X  ,  and  associated  probabilities  f(x^)i  fCx^)^ 

t^eh  f  is  called  the  probability  function  bt  X. 

3.  A  random  variable  is  like  any  other  variable  except  that 
we  know  more  about  the  random  variable,  namely  the  pro- 
babilities associated  with  its  realizations 

4.  Random  variables  are  either  discrete  or  continuous  (or 
sometimes  a  combination  of  these  two) 

a.  Discrete  random  variables  take  one  of  a  finite  or 
cbuntably  infinite  set  of  Values 

b.  Continuous  random  variables  take  any  value  from  an 
interval  of  real  numbers 

5.  Random  variables  may  also  be  vector-valued  as  in  multiple 
regression 

6.  In  the  next  lecture  we  discuss  some  special  random  variables 
at  length 
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Topic  2.    Sampling  ^periment 

i.    Basic  Issue:    How  can  we  best  learn  about  random  variation 

1.  We  can  collect  many  data  sets^  all  of  which  have  a  random 
nature 

2.  However,  it  is  more  expedient  to  construct  random  numbers  in 
a  controlled  "statistical  laboratory" 

3.  We  ^^le  from  3  distributions—Gaussian,  Rectangular, 
Exponential — and  study  the  nature  of  the  variability  of 
several  f^iiiar  statistics 

li.    Problem:    How  do  we  perform  this  control!  '  experiment 

1.  Let:    X  be  a  random  variable  with  a  Gaussian  distribution 

Y  be  a  random  variable  with  a  rectangular  distribution 

Z  be  a  random  variable  with  an  exponential  distribu- 
tion 

2.  X,  Y,  and  Z  are  continuous  random  variables,  with  pro- 
bability functions  as  shbvTh  (^) 

3.  We  shall  draw  100  "samples"  from  each  of  theae  distribu-  

tions,  with  sample  size  20,  by  using  a  paeudo^ random  number 
generator 

III.     Solution:    Study  variation  in  our  favorite  statistics 

—  2 

1.  For  each  sample^  from  each  distribution,  we  compute  X,  S  , 
M,  and  AH 

2.  Thus  we  have  100  sample  means  from 

a.  Gaussian  distribution 

b.  Rectangular  distribution 

c.  Exponential  distribution 

Similarly  for  sample  variances,  medians,  midspreads 

3.  We  make  a  stem-and-leaf  of  each  set  of  numbers  and  study 
the  variation 

4i    Questions  to  be  answered: 
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a.    How  much  variability? 

is  variability  of  a  statistic  constant  over  distributions? 

c.    Can  we  characterize  the  variability  matfiematically? 

Note:    Discuss  how  this  sampling  experiment  might  arise  outside 
of  our  "statistical  laboratory" 

Experiment 

1.  Sampling  distribution  of  X,  Y,  Z  (5a) 

(5b) 

a.  Note  symmetry 

b.  Batches  appear  quite  well  behaved,  especially  Gaussian 

2.  Sampling  distribution  of  Medians  '  (6a) 

a.  Also  symmetric,  and,  except  for  Exponential,  well  ^^^^ 
behaved 

b.  Spread  Is  larger 

-2      2  2 

3.  Sampling  distribution  of  S^,  S^i  (7a) 
a.    Note  skewness,  to  the  larger  values 


(7b) 


b.    Rectangular,  not  very  varied 


4.  Sampling  distribution  of  midspreads  (8a) 

 ^  _    _         2  (8b) 

a.  Also  skewed,  but  not  as  much  as  S 

b.  Less  varied,  except  for  rectangular 

5.  This  accords  well  with  theory  (9) 

1.    Sample  Mean  -  Gaussian  by  important  Central  Limit 
Theorem  as  N  " 


2.  Median  -  Gaussian,  But  with  larger  variance 

   2 

3.  Sample  Variance      x  •  a  skewed  distribution 

Midspread  -  Gaussian  (!) 


82G 
m.iii.  3i 


Lecture  5-1  

Transparency  Presentation  Guide 


ERIC 


Lecture 
Outline 
Location 

Beginning 

Topic  1 
Section  II. 
3. 


Transparency 
Number  


Description 


Lecture  5-1  Outline 


Policy  Question 


Section  IV 
1. 


Definitions 


Topic  2 
Section  II. 
2. 


Some  Continuous  Random  Variables 


Section  IV. 
1. 


2. 


3. 


4. 


5a 
5b 

6a 
6b 

7a 
7b 

8a 
8b 


Sampl±ng  Distributions  of  X, 
Y,  Z 

Sampling  Distributions  of  Median 


2      2  2 
Sampling  Distribution  of  s  ,  s  ^  s 
r      o  x'     y  z 


Sampling  Distribution  of  Midspread 


Theoretical  fonnulae  for  Sampling 

Dlstrli:  .Cions 


S27 


XVI. III. 32 


/fell  ^gag^t.* 


ERIC 


/P.  a 


30  .  Z 


XVI. Ill, 34 


ERIC 


i»3 


ERIC 


X 


»4   JtM££±li&W  jj6/teJ^tf/> 


!Lii2|tf^  mfi^m^  i^l^M  •  M  ¥d&it 


830 


XVI. III.  35 


XVl.IIi.36 


ERIC 


9 

f 

WN«T«<0** 

-1 

1. 

s 

1 

•J 

7 

•a 

f 

s. 

W 

•I 

ft  ^Mtfsaaid 

F 

1 

& 

"0 

5 

7 

oiailM 

0 

ttttttitfffNf 

7. 

1 

1 

t 

t 

S 

3Sft* 

F 

f 

1 

t 

5 

X 

rmTattitff 

r 

S. 

fttf 

lb 

I 

lb. 

t 

1 

1 

1 

a. 

77  f 

ii 

io 

11 

#??7f 

11 

Hit 

833 


ERIC 


QHPH 


iff 

-0.  17 

lift 

/fa* 

0.5a 

ffean 

-o.oa 

0.50 

a  al 

-a  03 

Oi50 

an 

D.3I 

0.08 

o.r/ 

Of 3 
©17 

1.08 

IM 

O.SI 
0.?7 


S34 


XV£.III.38 


835 


u 

u/vir»iflr> 

s 

^0 

a 

1 

M 

S9 

1 

J5MfO 

•1 

/Jf 

1 

% 

£il7tf 

j 

s 

ttfiiSi^'t 

6 

«KR05yiM(»44|77777t  f  tf 

s. 

1 

£ 

jra83«6ft7777nf 

k 

5VSi>M(77f 

\ 

01^4177 it 

on*! 

7 

ooe#aalj|¥V 

\ 

t 

nrny7tt9it 

7 

m 

i 

00111/  V 

r 

J 

t 

f 

». 

Iftf 

A 

i 

7 

f 

i^l  /  /./I   /7f  /.90 


/.9t 


I 


H 


83G 


ERIC 


^37 


XVI. lit.  40 


&b3 


-e.©a 

m 

o.id 

Gm 

m 

034 

ERIC 


gw<»i«-« 

J 

1 

1 

0 

1 

H 

H 
•f 

T 

S 

s 

• 

r 

r 

« 

«• 

ll¥ 

s 

7 

fit 

m7f 

e. 

istttitssiMmftf 

8 

Of  II II J  J9f«V8V«4m47  719 

1 

00000011  1 

f 

595nii77itlf 

t 

aaiiiJdU 

oiiit 

1 

F 

n 

7. 

S 

67 

a 

•ii^ir 

1 

/iiaam 

1. 

ItBtt 

n 

t 

d 

01  : 

n 

IS 

r 

XHS    Hi    S4S  9 

n, 

1 

1 

/» 
// 

M/iaf 

//. 

57 

3! 
o 

c 


H 
H 
H 


ERIC 


e.Jl 

0.19 

©07 

0.%8 

eiv 

/flux 

0.12 

arc 

G.H 

0.9? 

Hei 

oof 

O.03 

0.70 

s-io 


XVI. III. 42 


JtfW  IT  •  10"* 

t 

9911 

II 

1 

s 

f 

% 

WW!  BcW 

< 

V 

f 

• 

9. 

4 

lb 

H 

7 

n 

0er)«v^«7  ft 

t 

i 

asvmrrt 

ti 

J" 

ii 

5: 

« 

// 

i 

/j 

/S 

«r 

011 flit 

II 

3M 

*f 

«• 

Ktj  4.//  ^.tf  ae-  9.99 

n 

841 


'  Alt 


I 


w 

H 
H 


842 


ERIC 


ran 


I.JI 


0J8 


0.17 


d  J7 


X7I.ni.  44 


Arc  O<k»«ssic»o!^ 


XVI. Ill  45 


iiecture  5-2.    Expectations  of  Random  Variables 


Expectations  5f  random  variables:  Mathesnaticai  formulae  for  random 
variables,  and  means  and  for  means  and  variances  of  these  variables 


,  ^ecttt^^  _  _   .., 


(1) 


1.  Define  Blfippial^^oisson,  ana  tJt|±f ofto  r 
thi  contexts  in  which  they  occur 

2.  Deftoe  EKpbnential,  Rectangulr.r,:nand  Gaussian  randm  variables^ 
and  the  bQ?itext8  in  vh^^  K  "^^f^.^^^! 

3.  Disci^s  Mthematica^^^  independence 

^^^^'^^^  -n^'iSaon/ variables  •^^fe^^t^:>^^^K'>  ^ 


Main  Topics; 


r^.^iV^^^C^titiSoS^-"-j:andcMa  variat^i^i^^,^^^^ 
.  ^..3i.  ^Moinents  Ji^d  other  properties  of  these  variables 
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Topic  1*    Discrete  Ratiddtn  Variables 

I.    Basic  Issue:    When  can  a  batch  of  data  bie  characterized  by  one 
of  the  ispecial  types  of  discrete  random  variables? 

i.    Recall  that  discrete  random  variables  take  on  only  a 

finite  or  cdUhtably  Infinite  number  of  values  (one-to-one 
association  with  the  integers) 

a.     For  example  X  »  number  of  Orientals  In  a  census  tract, 
has  sample  space  S_«  {0,  1,  2,  N} ,  where  N  = 

total  population  of  tract 


b.  Or  Y  «  number  of  rides  taken  on  PAT  busses  In  1976, 
has  S  «  {0^  1,  2,  ...},  a  cdUntably  Infinite  set  of 
possible  values 

2.  We  deal  here  with  discrete  random  variables  In  general 

a.  X  =  discrete  random  variable 

b.  S  sample  space  of  X  =  {x^>  •  *  ^h"^  where  n  may 
be  infinitely  large 

c.  f(x^)  =  P{X  =  x^}  is  the  probability  function  of  X 

d.  The  X,  are  called  "mass  points",  and  f  a_ "probability 
mass  function",  since  f  gives  positive  probability 

or  weight  («  mass)  to  only  the  x^.  (f  is  also  simply 
called  a  probability  function) 

3.  We  shall  discuss  when  X  can  be  described  with  one  of  our 
special  mass  functions;  i.e.,  when  ±s  X  Binomial,  Polsson, 
or  Uniform 


EKLC 


II.     Problem:    With  only  a  few  realizations  of  X,  what  can  we  say 
about  the  discrete  random  variable  ? 

1.  Occasionally  we  are  able  to  take  several  samples  (record 
several  observations)  of  X 

a.  A  stem-and-leaf  display  jshould  be  made  of  this  batchy 
and  the  shape  studied  quite  closely  and  compared  to  the 
shapes  shown  later  in  this  lecture 

b.  If  we  suspect  thatt^  X  is  either  Binomial,  Poisson,  or 
Uniform,  compute  X  and  S?  to  compare  with  the  "theore- 
tical" values  of  these  quantities 

2.  But  if  we  have  no  observations  on  X^  we  must  use  whatever 
knowledge  we  hav^si  available  about  X  to  characterize  its 
random  nature 


Sir; 


XVT.  TT.T,47 
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Illi    Solution:    Tiiree  special  randoffi  variables  to  use  when  appropriate 
1.    Binomial  random  variable 

a*  Assume  an  experiment  involves  N  trials  or  obseryatlbns, 
each  trial  being  "independerv".  I.e.  distinct  from  the 
other  N-1  trials 


b.  Assume  each  of  the  N  trials  has  only  2  outcomes^  a  "0" 
or  "1"  (which  could  stand  for  any  pair  of  mutually 
exclusive  outcomes) 

c.  Let  p  «  Probability  of  an  occurreneebf  a  1  on  a  »  single 
trial  (this  does  rot  vary  from  trial  to  trial) 

d*    Then  _X_«  number  of  I's  on  N  trials  is  a  Binomial  random 
variable 

e.    Mass  fuiictibh  f(x)  *  p^(i-p)^"^ 

2.  Poissbh  random  variable 

a.  (Rar       /exits  p  ^  0,  approximation  of  binomial)  Assume 
a  fix^a  interval  of  time  or  space 

b.  Consider  a  specific  type  of  event  or  occurrence  in 
the  interval 

c.  Let  A  (lambda)  be  the  average  ^mean)  number  of  events  that 
occur  ill  the  inter^/al 

d.  Let  X  «  number  of  events  that  occur  in  the  interval  is 
a  Poissbh  Random  Variable 

-X^e'^  (5) 

e.  Mass  fuhctibn  f (x)  ■  —  

3.  Uniform  random  variable 

a.  Assume  an  experiment  with  a  sample  space  S  »  {x^,  ••'^ 

X  },  n  finite, 
n 

b.  If  each  x^_is  equally  likely  to  occur,  then  X  is  a 
uniform  random  variable 

c.  Mass  function  f  (x)  » 

d.  k     dbm    numbers    are  realizations  of  uniform  random  (7) 
var  iables . 
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Topic  2.     Continuous  Random  Variables 

I.     Basic  Issue:    When  can  a  random  variable  be  cohtlhudus? 

1.  In  general;,  all  meastarements  are  discrete—there  is  a 
smallest  possible  fraction     hat  we  can  measure. 

2.  However^  the  thing  measured  is  theo  etically  continuous 

3«     Continuous  random. variables  may  have  symmetric,  skewed,  or 
even  flat  probability  functions 

a.  Y  =  cbritihubus  random  variable 

b.  S  =  sample  space  of  Y  =  {y|a£v£b},  where  a  mtC  h 

are  any  Real  numbers 

c.  f  (y)   is  called  the  4ansity^  fun€4;iQn  of  Y,  since  the 
probability  has  been  smeared  over  an  interval  (a,b), 
and  every  smaller  interval  has  a  chunk  of  probability 

II.    Problem:  When  can  we  assume  that  a  specific  continuous  random 
variable  can  be  characterized  by  one  of  our  special  denssity 
functions 

1.  We  must  use  our  intuition  about  the  range  of  values  of  Y 
and  the  shape  of  empirical  realizationis    Foreknowledge — and 
ch.  sensitivity 

2.  i       .  can  prove  that  Y  is  either  Gaussian,  Rectangular , 

c    dx^,   ^ntial,  then  we  have  found  a  very  important  result 

ill.     Solution:     Specific  continuous  ranJom  variables 
1 .     Gaussian  (Normal) 


a.  The  "well-behaved**  distribution  of  single  batches 

b.  Random  variable  is  symmetric,  and  takes  on  values 
between         and  <» 

c.  Important  to  notice  the  tails,  and  make  sure  that 
they  are  not  too  fat. 

d.  There  are  many  bell-shaped  curvesl  e.g. >  Cauchy-- 
f(y)  «  ['rr(l+y  )]  -i--thick  tailed,  y  =  »  ,  no  mean 
(expectation),  integral  doesn't  converge 

..  XVi.iii.49 

O 

ERIC 


y(mean)eRi    a(std.dev. )  >0. 

Rectangular 

a.  Fiat  over  ah  interval  (a*b) 

b.  Density  function 

f(y5  -        e  <ylb 

Exponential  (Waicihg  time) 

a.  Waiting  times  are  invariably  exponentially  distributed 

b.  Density  function 

f  (y)  «  0e"^i  y  >.  0- 

0"^  »  average  (mean)  waiting  time       0  >  0 


Moauie  111 


topic  3.    Moments  and  other  properties  of  these  variables 

i.     Basic  Issue:    How  can  we  summarize  the  random  quality  of  these 
variables? 

1.  What  is  the  average^  or  typical,  value  of  a  random  variable 
(in  terms  of  long  term  results  of  repeated  experiments) 

2.  What  is  the  variance  o£  a  random  variable? 

3.  The  mean  (p)  and  variance        )  of  a  random  variable  X  are 
useful  and  easily  computed  summarizations  of  X 

 ..L  L  ^  2     -   2 

II.    Problem:     How  do  X  and  S    compare  to   p  and  a 

1.  Suppose  we  have  N  sample  observations  on  X  and  compute 
X  ana 

_^  2  2 

2.  Then  X  estimates  p  and  s    estimates  a 

3.  Only  as  "  does  X  ^  p  and  s^  ^  a^i  i.e.  cnly_  in  very 
large  samples  are  the  sample  estimates  identical  to  the 
population  values 

A.     Remember  our  sampling  experiment  arid  the  way  the  sample  quan- 
tities varied  about  the  true  values 

2 

III.     Solution:    How  to  compute  p  and  a 

1.  p  =  E(X) read  "expected  value  of  X"-- is  the  first  mdmerit 

of  the  random  variable 

p  =         f(x^)x^,  if  X  's  discrete  S  =  {x;.,  x^,  x^} 
i«l        i    i  I      z  " 

p  »    /3cf(x)dx,  if  X  is  continaouB 

9        ___  o  ijzi  2- 

2.  d    =  E((X  -  p)  )  -  read  VExpected  Value  of  (X  -  p)  — 

is  the  second  moment,  about  the  mean,  of  the. random 
variable 

d^=    ?    f(x,)(x,  -  p)^,  if  X  is  discrete 
i«l        J-  ^ 

2-        -   -  2 
a  «    /(x  -  p)   f(x)dx,  Jf  X  ifi  continuous 

3.  Examples 
a.  Discrete 

Sou 
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1.  Binomial 


2  - 

y  «  N^p  a    «  Np(i-p) 


ii.  Poisson 


]x  ^  X  «  X 


ill.  Uniform 


n    i.i  i 

.2  «i  E  (X.  -  Mr 

°    i«l  i 


Note  that  these  are  not  sample  quantities 
b.  Continuous 
i.  Gaussian 

2  -  2 
y  »*  y  a    "  o 

ii.  Rectangular 

y  =  f  (b-a)  ^^"^^^ 

ill*  Exponential 

-1  2  -2 
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Lecture  5-3.    Probability  and  the  Linear  Model 


ProTjabllity  and  the  Liniai  M6del:  Probabilistic  Assumptions  Regarding  the 
Errors  of  the  Linear  Model 

  (1) 

Lecture  Content: 

1.  Discuss  probability  distrlbutibn  of  the  error  tems  in  a  linear 
Hiodel 

2.  Introduce  several  continuous  probability  distiibutibns  Important 
to  the  analysts  of  a  linear  model 

Mai  nJopl  CB ; 

1.  The  Mathematical  Form  of  the  Linear  Model 

2.  Some  Additional  Distribution  Theory 
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Topic  !•    Hathematical  Fbrin  of  the  Linear  Model 

I.    Basic  Issue:    Review  our  not ions  of  regression  and  introduce 
probability  into  dtir  analysis 

Unit  4  presented  multiple  regression  as  model  fitting 

No  formalized  goodness-of-f It  measures  were  discussed,  since 
such  analyses  depend  on  probability  theory,  that  at  the  time, 
we  had  tiot  discussed 

In  this  lecture,  and  in  Unit  6,  we  rein troduce_the_ linear 
models  presenting  the  relevant  probabiiisticassumptionsj 
and  discuss  how  to  evaluate  a  fitted  model  with  probabiiis- 
tic  assessments 


II.    Problem:    What  do  we  assume  about  the  random  variation  of  the 
various  components  in  the  model 

i.  We  have  a  vector  of  responses  a  niatrix  of  vectors  of  _ 
carriers  X,  a  vector  of  regression  coefficients  §,  and  of 
course ,  residualis  • 

2*  One  approach  is  to  assume  that  y  is  multivariate  Gaussian 
distributed,  with  mean  essentially  work  with  the 

conditional  distribution^of  i  given  X. 

a.  Regression  may  be  approached  strictly  via  conditional 
expectations 

b.  We  always  assume  that  the  rows  of  X  are  known,  fixed 
constants 

c.  Hence  it  seems  logical  to  say  that  given  X,  y    is  a 
random  variable 

3.    A  simpler  method  of  introducing  probability  focuses  on 
the  residuals  of  the  model 

a.  This  approach  exclusively  will  be  utilized  by  us 

b.  We  assume  that  each  residual  y^  -  y^  is  a  univariate 
Gaussian  random  variable 


864 


1. 

2. 

3; 
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III*    Solution:    The  linear  model  with  Gaussian  errors 

1.    First  ve  review  the  linear  model  (2) 

a<.    Assume  yi  Is  a  linear  function  of  x^,^,  x^«i>  •••» 

1  .  1,2,"^..,   N  *2  1p 

b.    We  write 


Bq  Is  a  constant  term  In  the  model;        Is  the  1th  residual. 

c.  In  matrix  form 

y"  ?§  +  S  ^  column  of  I's  to  X) 

g  ((p3Cl)xl)      e  (Jlxl) 

d.  Assumptions  on  e 

i.    E(e)  «  0         each  residual  has  mean  (expected  value) *0 

2 

ii.    Residuals  are  independent ,  with  conistant  variance  a 

lii.    Hence  Cov(e)  «  a^I     {^xN )  matrix  < 

iv.    Since  S  «  X  B  +  e;  i.e.    y  Is  the  sum  of  e  and  (3) 
XS  »  y  has  tKe  following  moments: 

A.  y  =  XB  e 

B.  »  E(X  B  +  e) 

-  XB    +  E(e)  «  XB 

C.  Cov(y::  «  Cdv(XB    +  e) 

2 

*  Cdv(e)  ■  d  I 

_        _     _  _  2 
V.    The  new  assumption  Is  e^  ^  Gaussian  (0,a  ) 

Gaussian  residuals 
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•    This  new  probabilistic  assumption  also  affects  the  dis-  (4) 
tribution  of  b,  the  Least  Squares  Regression  Coefficients 

i«  Itemember, 

b  -         ^  -  Xb*3  *  (y— Xb*) 

miniicize  sum  of  squares  of  residuals 

ii.    By  differentiating,  we  f^v*d^ 

b  =  (X'X)""'' 

=  ^>  where  M  =  Ol'X)*^'    a  linear  combination 
oJ  y 

iii.    Hence,  since  a  linear  combination  of  Gaussian  random 
variables  is  also  Gaussian^  b  is  Gaussian,'  with 

A.  E(b5  -  ECH^)  =  ME^)  =  (X'X)*^X'XB  =  §  (unbiased) 

B.  Cov(b)  «  H  Cov^)M 

»  ^^^x'x5"^x'x(x'x)^* 

»  a^CX'X)""^ 


Topic  2.    Some  Additional  Distribution  Theory 

I.    Basic  issue:    Introduce  other  probability  functions  important 
in  regression  analysis 

i_       _  _  ^    2   

1.  When  fitting  ainodel,  we  examined  t-Btat±3tics,  R  ,  and 
pairwise  correlation  coefficients 

2.  Each  of  these  quantities*  is  a  random  variable  with  a 
specific  distribution  function 


II.  Distributions 


b. 


1.     t  »  r>  ^       \  follows  a  t  distribution  on  N-p  degrees  of  (5) 
freedom,  if       -  0 

N-p-1  , 


— «>    <     t     <    "      .  •■•«■'■"* 

b.    As  N  -»•  *,  f  (t)  -»•  Gaussian  (0,1)  (6a) 

(6b) 

2.    ^  (7) 

Sample  Correlation  Coefficient  of       and  Xj 

a.  If  p^^  ■=  b  (population  values  is  zero) 

^  Gaussian  (0^  approximately 

b.  This  approximation  only  holds  for  large 

3.  --i-    E(y^  -  y^)^/        foiibws  a       distribution  on  (N-p)  (8) 
N-p         i  i 

degrees  of  freedom.    Sum  of  squares  of  Gaussian  random 
variables  arc  -  Chi-sqoare. 
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4.    If  SSj^-  0^  x\  and  SS^  -  x^,. 


then        y7    -         ^    F  distribution,  ratios  of  variances. 


5.    Since  R    is  a  ratio  of  sums  of  squares, 
.2 


-  F 


P-i.  N-p 


if  no; linear  relationship  exists  between  Y  and  X. 


(9) 
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Bonewbrfc  Problems 
Dnlt  5 


Let  N  represent  a  noxH  response  to  *jnailed  qaestiomaire  an^  R  

represent  a  response.  We  tnall  questions  to  four  people  on  a  specific 
day. 

a)  BcK^  many  eiments  a      in  the  sample  space^ if  we  are  interested 
in  the  number  of  responses  to  the  questionnaire  veiled  on  the 
given  day?    List  theni«    Are  these  outcomes  equally  likely? 

b)  Bow  n^y  eionetits  are  in  the  sssiipte  space  if  we  are  interested 
in  the  response  to  each  questionnaire^^! led  on  the_given  day? 
XEach  questionnaire  is  distinctly  identified  by  a  code  number • ) 
Idst  them.    Depict  the  set  of  outcomes  representing  responses  to 
three  but  of  the  four  questionnaires.    Depict  the  set  of  outcomes 
representing  at  most-one  response. 

c)  What  is  the  probability  that  the  non-responses  represent  ques- 
tionnaires lost  in  the  mail? 


An  econometric  model  predict swhether  the  GNP  will  Increase  (i)^ 
decrease  (d),  or  remain  the  SMie  (s)  in  the  following  year.    The  GNP 
will  then  be  observed  to  increase  (i),  decrease  (D) ,  or  remain  the 
8ame(S)  • 

a)  List  all  the  elements  of  the  sample  space. 

b)  Depict  the  events  that  the  model  predicts  correctly. 

c)  Depict  the  events  that  the  model  predicts  correctly  using  a 

Venn  Diagram  (Hint:  first  consider  the  possible  outcomes  (part  a) 
as  a  matrix.) 


According  to  accident  reports,  25%  of  all  accidents  which  occurred 
while  equipment  ims  being  used  were  caused  by  faulty  equipment  and 
75%  by  inq>roper  use  of  the  equipment.    The  probability  that  on  a 
^ven  day  an  acctdent  i^ll  occur  while  equipment  is  being  used  is 
•05.    Use  set  notation  to  define  the  following  events  as  unions^ 
intersections  cd^lments,  etc.    then  calculate  the  probability  that 
each  event  will  occur  oh  a  given  day. 

a)  An  accident  occurs  caused  by  faulty  equipment. 

b)  An  accident  occurs  caused  by  improper  use  of  equipment. 

c)  No  accident  occurs.  8S0 
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d)  An  accident  was  ipaused  by  faulty  equipment,  given  that  an  acci- 
dent has  occurred. 

e)  An  accident  occurs  caused  by  either  faulty  equipment  or  improper 
use. 

f)  An  accident  occurs,  but  the  equipment  was  found  not  to  be  faulty. 

A  jail    has  490  inmates.    It  is  known  from  the  records  that 

300  cdimnitted  armed  robbery 
200  ccmnittee  larceny 
50  cddmitted  homicide 

20  ccnanitted  armed  robbery  and  homicide 
30  ccnmitted  larceny  and  homicide 
20  committed  larceny  and  armed  robbery 
10  committed  all  three  crimes 

a)    Draw  the  Venn  Diagram  illustrating  this  problem. 

if  we  draw  an  iimate's  file  at  random,  what  is  the  probability  that 
the  inmate  committed; 

a)  Two,  but  only  two,  of  the  three  typiss  of  crimes- 

b)  At  least  one  of  the  three  types  of  crimes. 

c)  Homicide,  given  the  itmate  cOTomitted  armed  rol>bery. 

d)  Homicide,  given  the  imate  did  not  ccnmnit  armed  robbery 

e)  Armed  robbery  or  larceny 

f)  Only  one  of  the  three  types  of  crimes 

g)  Arson 
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Age  >t  MBrrlage,  Hasbasd  asd  Wife,  New  Baven,  Com. 


Age  o£ 
Husband 


Age  of  Wife 


15-19    20-24    25-29    30-34    35-39    40-44     45-49    50  + 


15-19 
20-24 
25-29 
30-34 
35-39 
40-44 
45-49 
50  and  tfver 


42 
153 
52 
5 
1 


10 
504 
271 
52 
12 
1 


3 

51 

ib 

1 

184 

22 

7 

87 

69 

13 

27 

29 

21 

9 

18 

17 

3 

6 

16 

1 

4 

li 

355 

158 

86 

2 
5 
2 

8 
16 
15 


3 

2  1 

7  1 

21  43 


Total 


254 


850 


48 


33  45 


Source:    A. JB.  Hollinigshead,  "Cultural  Factors  in  the  Selection  of 
HsLrriage  Mates,"  American  Sociological  Revtm  15,  1950,  p. 
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a)  Are  the  ages  of  husband  and  wife  independent?    What  striking 
fact  about  these  data  immediately  answers  this  -question? 

b)  What  is  the  probability  that  one  partner  was  between  the  ages 
30-34? 

c)  Wlmtii  the  probability  that  one  partner  was  between  the  ages 
of  30-34  and  the  other  was  between  the  ages  of  20-24? 


d)    What  is  the  probability  that  the  husband  was  between  the  ages 
of  30-34  ana  the  wife  was  between  the  ages  of  20-24? 


e)  What  Is  the  probability  that  the  wife  was  between  theagesof 
20-24  given  that  the  husband  tms  between  the  ages  of  30-34? 

f)  What  ii  the  probability  that  both  partners  were  at  least  45 
years  old? 


g)  What  is  the  probability  that  at  least  one  partner  was  at  least 
45  years  old? 

h)  What  Is  the  probability  that  neither  partner  was  at  least  45 
years  old? 
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Ydii  are  vdrRlng  on  your  annual  report  for  the  ©ayor.of  a  sml^ctty. 
The  fire  department  reports  that  last  year  they  responded  to  the 
fbltowiag  nunber  of  false  alarms  per  week: 


#  weeks 

#  false  alarms 

1 

0 

4 

1 

7 

z 

10 

3 

15 

4 

8 

5 

7 

3 

7 

i 

8 

9 

52 

a)  Use  a  stem  and  leaf  displ^  to  identify  the  distribution  which 
the  data  follow. 

b)  What  is  the  average  nonber  of  false  alarms  per  week? 

c)  What  is  the  standard  deviation  of  the  number  of  false  alarms  per 
week? 

d)  Use  the  answers  to  (b)  and  (c>  in  the  probability  function  you 
specified  iti  (a)  to  verify  your  choice  of  distribution  (just 
calculate  two  of  the  theoretical  number  of  occurrences,  i.e. 
PtX  *  0}  ,  P^X  -  1}  ) . 

e)  if  each  false  alarm  costs  the  city  $1600,  what  should  you  bud- 
get for  false  alarms  for  the  first  three  months  for  next  year? 
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As  chiirpersbfi  of  a  public  Bervice  organtzation's  trust  fiind 
fou  are  considering  buying  one  of  two  stocks*  both  currently  priced 
at  $46  per  share,  for  a  one  month  trading  venture.    You  have  esti- 
mated the  probability  distribution  for  the  closing  prices  of  the   

two  stocks  (rounded  to  the  nearest  dollars)  one  inOnth  hence  as  follows; 


Stock  A 
Closing  price 


f  (pvie^> 


Stock  B 

ciosingjgt 


f4price) 


44 
45 
46 
47 
48 
49 
50 
51 
52 
53 


44 
45 
46 
47 
48 
49 
50 
51 
52 
53 


.665 
.015 
.030 
.100 
;35b 
.350 
.100 
.030 
.015 
.005 


a5    Find  the  expected  value  of  one  share  of  each  stock. 

b)  Find  the  variance  of  the  price  of  one  share  of  each  stock 
(Financial  analysts  Often  refer  to  variance  as  "risk"). 

c)  Which  stock  would  you  purchase  and  why? 


4 
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8*     a)    As  a  city  natiager^  ycu  stop  off  to  visit  the  7th  Precinct 
Police  station  to  look  at  the  crime  statistics  (crimes  per 
day)  of  the  past  3  months.    Assuming  a  Vtyplcal**  period 
(not  a  crime  wave),  and  discounting  the  "full  tnoon"  theory,  to 
what  distribution  to  you  expect  the  data  to  conform? 

b)  Later  you  enter  the  comptroller* 8  office  to  pick  up  some 
financial  data.    After  fitting  a  regression  to  this  data, 
you  plot  the  residuals  and  determine  that  the  regression  fits 
remarkably  veil.    To  what  distribution  do  the  residuals  conform? 

c)  Stopping  for  lunch  at  a  hamburger  jbiht  (since  city  managers  can't 
«fford  real  food)  you  pass  the  time  waiting  In  line  by  noting 

how  l^ng  lEsach  customer  takes  to  be  served.    The  statistician 
in  you  immediately  recognizes  that  these  data  fit  a  dlstrl*^ 
butlon,  which  you  rush  off  to  plot.    What  distribution  caused 
you  to  miss  lunch? 


a)  When  ybu_ return  to  your  office  you  resuEse  work  on  the 

f irancial  data .    Looking  at  the  daily  eacpense  account  sheets 
for  the  past  year,  you  can't  help  wondering  about  the  distri- 
bution of  the  last  digit  (the  "unit"  digit  denoting  single 
dollais)j  80  you  make  a  plot.    What  distribution  do  you 
expect  this  data  to  follow? 

b)  Finally  getting  to  work,  you  correct  the  errors  in  the   

financial  reports.    What  distribution  do  you  expect  the  number 
of  errors  per  report  to  follow? 

c)  Late  in  the  afternoon  you  visit  a  new  housing  project.  The 
construction  supervisor  tells  you  that  he  has  found  9  faulty 
valves  in  the  lot  of  96  acquired  for  the  site .    Since  you 
expect  to  need  another  200  lots  (of  96  each)  over  the  next  6 
months,  you  need  a  rough  estimate  of  how  many  additional 
valves  to  order  to_ replace  the  faulty  ones.    What  dtstrlbutlbh 
do  you  expect  the  number  of  faulty  valves  to  follow? 

d)  Finally,  after  a  hard  day  running  arotxnd  and  using  your  profound 
quantitative  skills,  you  retire  toyour  favorite  nightspot. 
However,  your  acute  mind  does  not  fail  to  notice  the  huxaber  of 
tatijgs  of  draught  beer  ordered  by  the  custOTaefs^   You  repeat  this 
exercise  every  night  for  a  months  except  for  Sundays,  when  you 
stay  home  to  view  "f&sterpiece  Theater"  and  ••Evening  at  Symphony 
with  the  BSO,  to  catch  up  on  your  cultural  events.    What  distri- 
bution do  you  expect  the  beer  coniiimiption  data  to  follow? 
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10.  The  following  are  the  earnings  for  the  city  hall  staff  for  the 
we^  of  February  4,  1977: 


a)  Is  the  underlying  distribution  (of  weekly  earnings)  discrete 
or  continuous?  Why? 

b)  Compute  the  mean  of  the  above  distribation. 

c)  Is  the  answer  in  (a)  the  same  as  you  would  have  obtained  had 
you  calculated  the  ratio: 

tl  earnings  for  aHr^eitv  hall  staff^jauring  the  week; ended  4  Feb. -7 
total  1^  city  hall  staff  during  the  week  ended  4  Feb.  77 

Why  or  why  not? 
d5    Compute  the  mediffli  of  the  above  distribution. 

e)  In  which  direction  is  the  data  skewed? 

f )  The  city  cdmptroller  stated  that  the  total  payroll  for  the 
week  ended  4  Feb.  77  was  $10675^18^    Do  you  have  any  reason 
to  doubt  this  statesnent?    Support  your  position  very  briefly. 

g)  Is  the  mean  computed  in  (a)  a  satisfactory  description 

of  the  "average"  or  typical  earnings  of  these  50  employees 
in  the  week  of  4  Feb.  77?    Why  or  why  not? 


earnings  for  week 
ended  February  4.  1977 


#iniployees  with 
given  earaLngs- 


187.50  to  194.99 
195.00  to  202.49 
202.50  to  209.99 
210.00  to  217.49 
217.50  to  224.99 
225.00  to  232.49 
232.50  to  240.00 


2 
7 
9 
i4 
10 
6 
2 
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Homework  Ualt  5 
Solutions 


1(ib)    There  are  5  poBsible  butcbmM  {0,  1,  2,  3,  4}.    They  are  (probably) 
not  equally  likely^  but  ve  do  hot  know  for  certain. 

-  _      -4  _  _    _  -  -  -   m  — 

(b)    There  are  2    «  16  outcomes  |4  questionnaires  each  with  2  possible 

outcomes;  N  =  ho  respohse^  R  ^  response): 


{NNNS,  NNNR,  NNRN,  NNRR,  NRNN,  NBHR,  NRRN^  NRRRj  RNNN^ 
RNNR,  RKRN,  RNRR,  RRNN,  RHNR,  RRRN^  RRRR} 


{3  responses  out  of  4}  -    {RRRN^  RRNR>  RNRR^  NRRR} 

{at  most  one  Response}  -    fWTNN,  NNNR^  NNRN,  NRNN,  RNNNj 


(c)    Without  data  frcn  a  carefully  planned  and  correctly  lii5)lemehted 

es^erimeht;  this  question  cannot  be  objectively  answered.  Remeber 
that  there  are  many  reasons  for  a  hon-responser 

2(a):  l(t,I),  (t,S),  (1,D),   (8,1),  (8^S)^  (s,D)i  (d,I),  (d,S),  (d,D)} 
(b)    {model  predicts  accurately}  -  C  (1*1) *  (s ^S) ,  (d,D)} 
(c) 


d                      s                    i  ------ 

(d.I) 

(8,i) 

(d,S) 

(i.S) 

SIS 

(8,b) 

(i.D) 

shaded  area  represents 
event  {model  pre^ 
diets  accurately! 

Venn  diagram  in  this 
Instance  includes 
a  grid. 


3a)    ?(accident  caused  by  faulty  equipment) 
-  .25  (.05)  *  .0125 


b)  P (accident  caused  by  improper  use) 

m  .75  (.05)  -  .0375 

c)  P(no  accident)  -  1  -  P(accidcnt)  »  .95 

d)  P(faulty  equipment/accident)  »  .25 

a)    P  (accident      faulty  equipment)      (acctdmt       l^rdper  use) 
P (accident)  ■  .05 


f)    P(not  faulty  aquipmant/accident)  -  P(tmproper  use/accident)  -  ^75 


Module  III 


4. 


/a  x 

/mo)  jLA 

where         A  "  armed  robbery 
L  "=  larceny 
H  •»  hoinicide 

a)    p(2  of  the  3)  -  pCLHA  or  lHh  or  AHh) 

-  pClHa  +  pariH)  +  pCAHh),  since  disjoint. 

«  IL.  +  2G_  .  10_ 

490     490  496 


49 
490 


.08 


b)    p(at  least  one  of  the  3  types)  -  1,  since  everyone  in  the  jail 

cdimnitted  at  least  one  of  the  three 
crimes . 


c)   p(h|a)  -  ^ 


P(A) 


^490  _  10_ 
300/490  "  300 


.033 


a>    p(hIA)      HhR^)  ,    30/490  .  ^  .  16 
P^nj«^  190/490      190  *" 

e)    pCAUt)  -  P(A)  +  p(L)  -  pCAHL) 


300+200-20 
490 


480 


.98 
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p(only  one  of  the  three  types)  ■  p(A  only)  +  p(L  only)  -  p(H  only) 


120  .  160  .  10_ 
490     490  490 


p(ar8on}  "  0,  no  Individual  was  assumed  to  have  comnitted  arson. 


a)  It  18  clear  that  the  ages  of  husband  and  vlfe  are  hot  Independent^ 
since  the  counts  cluster  along  the  diagonal  from  upper  left  to 
lover  right ^  with  many  eiq>ty  cells  In  the  other  corners « 

b)  P(bhe  partner  between  30-34)  -  P(wlfe  between  30-34)  + 

P (husband  between  30-34)  -  P(both  between  30-34)  «    158  -f  231  -  69 

m  1839 
1839 

c)  P (husband  between  30-34  wife  between  2b**24)  + 
P(wlfe  between  30-34  husband  between  30-34)  « 

1839      1839  1839 

d)  F(husband  betveen  30-34         vlfe  between  20-24)  -    52  .028 

Tl39 


e)  P(wlfe  between  20-24/husband  between  30-34)  -    52  ... 

231  -"^ 

f)  I>  (husband  45-49       56f)        (wife  45-49       504-)  -=  7  -f  1  -f  21  +  43 

72  1839 
1839 


g)    P  (wife  45-49       5JM-)      (husband  45-49      501-)  -  P(wtfe  45-49)  + 
P(wlfe  50t)  +  P(hu8bahd  45-49)  +  P (husband  5CH-)  -  P(wlfe  &  h 
F(wlfe  and  husband  45-49)  -  P (wife  45-49  and  husband  50+)  - 
P(wife  and  husband  50f)  -  33  4-  45  +  50  +  95  -  7  -  1  -  21  -  43  - 

.082 

1839 


h)    P (neither  partner  was  45  or  older)  «  1  -  P(at  least  one  partner 
was  45  or  older)  «  1  -  .082  -  .918 

SS!) 
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6.  (a)    the  data  arep6i88on»  X  »  4 

(b)  4  false  alarms  per  week  C^} 

(c)  variance  of  a  poisson  *  X;  standard  deviation  is  therefore 
or  2 

(d)  3  months  x  4-1/3  weeks /month  «  13  weeks  x  4  false  alsrms/week 

«  52  expected  false  alarms 
52  X  1600  »  $83,200  is  the  minjjnam  which  should  be  budgeted. 
To  insure  that  the  department  does  hot  run  short*  more  should 
be  budgeted  (probably  enough  for  another  2  false  alarms  (one 
standard  deviation^  per  week 

7.  (a)    E(A)  *  E(B)  «  $48.50 

(b)  Var(A)  *  8.25 
Var  (B)  -  1.57 

(c)  Assuming  other  factors  equal »  since  the  "risk"  of  B  is  less 
than  that  of  A  with  the  same  expected  vaiue»B  is  preferred. 

8.  (a)    Poisson  (or  possibly  uniform) 

(b)  Gaussian 

(c)  ea^onentiai 
9  •    (a)  uniform 

(b)  poisson 

(c)  binomial 

(d)  Gaussian  (or  possibly  rectangular,  maybe  even  Poisson) 
10.     (a)    fhiile  the  underlying  distribution  Is  technically  discrete 

(since  fractional  cents  are  not  permitted)  the  measurement 
$.01  is  so  small  that  we  usually  consider  soch  distributions 
to  be  essentially  continuous. 
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(b)  I  »  -  213.60 

(c)  Ho,    the  ratio  gives  the  exact  neah  while  part  (b)  gives  only 
a  close  apprbacinatloa»  since  a  frequency  tahle  was  used. 

(d)  K^dian  class  la  210.00  to  217,49.    Hcdlan  oBservatlon  Is 

SM,  -  25.5  the  oBservatlon,    Median  -  $210  +  $7,50 
2 

«  $213.75  Cinterpolatlon) 

(e)  By  conpartng  the  mean  and  median,  we  note  that  the  data  are 
slightly  skewed  to  left,  but  for  practical  purpose  It  is 
symmetrical. 

(f)  No,  since  50^)  -  $10,680.    This  figure  is  an  estimate  of 
the  total  payroll  irtiich  is  close  to  the  figure  of  $10,475.18 

(g)  tcs,  the  mem  here  seems  typical  since  there  is  little  skewness 
to  distort  it. 
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Quiz 
Unit  5 


Kame 


Point  vataes  are  given  in  parentheses,  preceding  every  question.  You 
have  sixty  (60)  minutes  to  complete  this  quiz. 

please  write  all  your  answers  on  these  pages,  in^hc  space  provided. 
Answers  should  be  brief  and  anccinct;    clearly  expressed. 

Ifhen  appropriate,  answers  may  be  left  in  fractional  form,  e.g. ,  693/721. 


(20)     1.    You  have  constructed  a  univariate  linear  regression  model 
relating  the  response  variable,  miles  per  gallon  for  1977 
Volkswagon  Rabbits,  and  the  carrier  variable^  tire  pressure 
per  square  inch.    You  have  only  15  paired  observations  to 
estimate  the  parameters    a  and    b    In  the  following  model: 

y^  «  a  +  b  x^ ;    t  •  1 ,  2 ,  .  .  .  15 • 
The  least  squares  estimate  of    b    is  -0.16,  very  nearly  zero. 

a)  What  distribution  does  the  quantity 

-^0.16 

S(V^(X^-X)^)"^ 
follow,  where  s^^  1/13  Z (y^-y^)^?    Sketch  the  shape  of 
its  probability  function. 

b)  is  it  correct  to  assume  that  this  quantity  is  Gaussian? 

c)  When  can  this  as8UB^)tton  accurately  be  made? 
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(iO)    2.    Ai  an  saployee  of  Peniiylviiiia  DepartMfit  of 

you  are  concerned  with  structural  faults  in  steel  I-beams 
used  to  construct  bridges  in  the  Plttiburgh  metropolitan  . 
area,    "fou  inspect  i  shipment  of  I-beams  from  the  international 
Steel  Company.    You  are  told  that  the  probability  of  a  fault 
in  any  given  1-beai  is  D.0005.    What  distribution  do  you 
expect  the  number  of  defective  beams  to  follow?    If  there  ^ 
are  N  -  4000  I-beams  in  the  shipinent  ,  what  are  the  mean  and ^ 
variance  of  X  -  dumber  of  defective  I-beams  in  the  shipment? 


(10)    3.    YOU  are  interested  in  the  traffic  flow  off  the  6th,  7th  and 
9th  Street  Bridges  into  the  North  Side  of  Pittsburgh.  On_a 
specific  FJftday  ifternooh  between  4  and  6  p.m. ,  you  record 
the  tiBie  in  seconds  between  cars  §8  the  cars  drive  off  the 
9th  Street  Bridge  Into  the  North  Side. 

You  coipite  the  average  waiting  time  between  automobiles  to 
be  10  seconds. 

Sketch  the  most  likely  probability  function  fore  the  waiting 
times.    With  what  random  variable  is  this  function  associated? 


(45)    4.    In  studying  the  records  of  Aggravation  Airlines,  it  haO|e5 
found  that  the  actual  arrival  time  of  the  scheduled  5^00  p.m. 
flight  from  Philadelphia  to  Pittsburgh,  due  in  at  6:00  p.m., 
is  a  uniformly  distributed  random  variable  in  the  range  of 
6:00  p.m.  to  7:30  p.m.    Let  »-l  represent  6:00  p.m.,  X-Z  re- 
present 6:01  p.m.,  etc. 

a)    Write  out  €ti6  oathcmattcal  expression  for  f(X). 
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b)    JHhsLt  18  the  probability  that  the  plane  will  be  late? 


c)    What  la  the  probability  that  It  will  be  more  than  1  hoar 
late? 


(  5)    5.    The  number  of  potholes  along  a  100  yard  at retc^^      the  Parkway 
East  Is  a  Polsson  random  variable  with  a  mean  of  40i__Vbat 
is  the  prolsabiiity  that  aldtig  a  specific  100  yard  length 
there  are  no  potholes?    Leave  answer  in  terms  of  a  power  and 
multiple  of  e« 


(20)    6.    Consider  the  following  data,  giving  the  ntm*  people  arrested, 

by  race  and  age,  in  1976  in  |t  small  town  in  Ohio: 


.White 
Arrested  Population 

Black 

Arrested  Population 

15-24 

378 

27,000 

65  5,000 

25-44 

324 

36,000 

32  4,000 

45-74 

108 

27,000 

3  1,000 

totals 

810 

90,000 

100  10,000 

Assuming  that  an  individual  is  drawn  at  random,  find  the  pro- 
bability of  the  following  events: 

a)  A  person  is  arrested. 

b)  A  perscm  who  is  arrested  is  white. 

c)  A  black  person  is  between  the    ages  of  25*74« 

d)  A  person  is  arrested^* given. that  he  or  she  is  25  or  over, 

e)  A  person  who  is  arrested^  is. black  and  aged  45-74« 

f)  A  person,  who  is  arrested^  is  aged  25*44* 
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(10)    ?•    You  have  i  large  population  of  indtviduaisand  Save  recorded 
aie  1976  Federal  Income  Tax  paid  by_ each  individual i  You 
break  tlie  p^iilatlon  Into  200  separate  bat cfies^  and  compute 
the  mean  and  variance  of  the  tax  payBaents  for  each  batch. 
Theoretically^  what  probability  distribution  should  the 
variance  of  the  tax  payments  follow? 


(lb)    8.    In  the  article  "The  Useof  Subjective  Probability  Methods 

in  Eat tiu ting  Demand*' ^_by  Hanns  Schwars^  what  is  subjective 
probability,  and  how  does  it  differ  from  probability 
defined  as  relative  frequencies  ?    How  does  Schware  use 
subjective  probability  to  get  reasonable  estimates  of 
demand  ? 
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Quiasi  Unit  5 
Solutions 


1.    a)    t  distribution,  on  13  degrees  of  freedom. 


b)  No. 

c)  Never  cin  this  assSiption  be  exact.    HSwever,  when 

tt>  3Gi  ipproxlmatlng  t  ty  the  Standard  Gaussian 
distribution  is  generally  acceptable. 


2.    BinomiaU  I^QO^  ps^OOOS 

p,  ■  Np  -  4000  (.0005)  ■  2 

CT^  -  Np(l-P)  *  2(.9995)  -  1.9990 


3.  Exponential 


1/e  -  10 


4.    «)    f(X)  -  1/91;  X  -  1,2,..., 91. 

b)  90/91 

c)  30/91 


5.  e 


-40 
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6.  910/100,000  «  91/10^000 

810/910  «  81/91 
c5    5000/10,000  -  1/2 

467/68,000 

3/910 
f)  356/910 


_2 

7.    Multiple  of  a  X    randdo  variable 


Subjective  probability  Is  derived  from  personal  opinions  about 
events  that  have  occurred  or  will  occur,  rather  than  strictly 
from  direct  dbiervatlons  of  past  events. 


Schwar^  chooses  arbitrary  velghts  for  opinions  about  demand  (will 
buy,  may  buy>  v6n*t  buy)  according  to  such  factors  as  purchaser 
and  expected  date  of  purchase. 
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Read  Ing^  As  s  Isnmea ts 
Unit  6 


tofllgnment 

Hob teller^  Roarkc,  and  Thomas, 
Sections  9-1  -  9-4 

Mueller,  Schuessler  and  Costner 

Chapter  13 
Mosteller,  Rourke,  and  Thomas^ 
Sections  10-1  -  10-4 

Mueller,  Sciiuessler  and  Gbstner 

Chapter  14 
Mosteller,  Rourke,  and  Thomas, 
Chapter  12 

In  addition  picasc  read  the  following  articles: 
fanur^  pages  220-8 
Tufte,  pages  285-351,  391-406 


Texts:  .  . 

Mosteiier,  F^,  at.  at,,  Profaabtlttv  with  Stettstica^Appitcattons  , 
Second  Edition.  Reading.  Massachusetts:    Addtson-Wesley ,  1970. 

Moeiler,  J.  H.^  et,al,  ^  Statistical  Reaaoalgg  ^  Socioiogy^> 
third  Edition,  Boston:    Houghton  Mlffliti,  1977. 

tanur,  J.  i^et.aii,  adltbrs*  Statistics:   ^^uida^  to  the  Untawn, 
•   San  Francisco:    Bolden  Day*  1972.  .  ^ 

Tufte,  E.  R.t  editor i  The  euantitatlve  Analysis  of  Social i^&i^^l^^i^L^-. 
Reading »  Massachusetts:    Addlson-Wcsley,  1970. 


tactiara 
Lecture  6^6 

Lecture  6*1 
Lecture  6-2 


S9S 


tecture  6-0.    Introdttction  to  Unit  6 
latrodaction  to  Unit  6.    Statistical  Inference 
Lcctar^^ipntent! 

i.    Introduction  to  the  objectives,  p'^^l'l^t         notation  of  Unit  6 


Hatn^  Toxica! 

1.  Specific  intrbdttctioh  to  the  Objectives  of  Unit  6 

2.  Presentation  of  General  Problem  of  Unit  6 

3.  Notation  for  Unit  6 
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Topic  i.    Specific  Introduction  to  the  Objectives  of  Unit  6 

I.    Qoestions  to  be  answered  in  Unit  § 

.1.    Is  it  ever  possible  to  study  an  entire  population? 

Ml    Such  a  complete  Study  is  called  a  census;  every 
individual  in  the  population  is  sampled 

b.  Usually,  the  researcher  only  has  the  opportunity  and 
ability  to  study  a  fraction  of  the  population,  called 
a  sample 

c.  The  findings  of  the  study  apply  only  to  the  sample;  the 
sample  statistics  estimate  the  true  population  value 

2.  How  much  can  we  ihfet  from  these  findings^   in  our  effort 
to  study  the  entire  population? 

a.  This  procesli  of  "extending"  bur  analytical  results  is 
called  atatiat^cal^^vfe^rence 

b.  The  example  illustrates  the  problem  (1) 

3.  Can  we  quantitatively  assess  how  good  an  estimate  is? 


II.    Skills  to  be  mastered  in  Unit  6  (2) 

1.  Calculation  of  probabilities  using  the  Gaussian  probability 
function 

2.  Making  estimations  about  the  values  of  parameters  in  the 
population 

3.  Placing  intervals  around  these  estimates  to  give  a  range 
of  possible  parameter  values 

4.  Testing  relationships  concerning  the  variables  in  the 
population 
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Topic  2.    Introduction  to  the  Problens  of  Unit  6 


I.    What  is  statistical  inference? 


1.  For  a  specific  batch  of  data,  obtained  as  a  sampie  from 
a  larger  population,  we  compute  various  statistics: 

a.  p  "  proportion 

b.  X  "  sample  mean 

c.  H  ^  sample  median 

d.  r  ^  sample  correlation  coefficient 

2.  These  quantities  are  estimates  of  true  population  values, 
called  parameters 

a.  X  estimates  y 

b.  r  estimates  p 

c.  P    estimates  P 

3.  Statistical  inference  is  concerned  with  how  well  these 
statistic^  estimate  population  values 

a.  How  much  "faith"  can  we  have  in  any  given  estimate? 

b.  Our  notion  of  "faith"  will  be  quantified  through  the 
use  of  probability,  especially  probabilities  from  the 
Gaussian  distribution 

4.  Statistical  inference  makes  the  risk  associated  with  the 
use  of  a  specific  statistic  explicit  and  known 

IIv    Calculation  of  Gaussian  probabilities 


1.  Suppose  that  X  is  a  Gaussian  random  variable,  with  mean  M 
and  variance 

a.  Transparency  shows  f(X),  and  probabilities  associated  (3 
with  i  standard  deviations  from  p* 

b.  We  desire  to  compute  P{a  £  X  £  b}    for  some  a  <  b. 

2.  We  standardize  X  to  a  standard  Gaussian  random  variable, 
and  then  use  tabulated  values  of  this  standard  distribution 

Bui 


ERIC 


Module  ill 


3.    For  example:  (4) 

where  2-  Gau  (0»1) 

4i    Transparency  shows  Tables  of  Standard  Normal  Random  (5) 
Variable 

ip{0  1  X  £  a}       for    a  >  0. 

5 .    R^ueiiber : 

a.  tiX  <,  a}     where      a  <  0 

-  P  {X  >^  -a}     by  syasnetry 

b.  iPlX  >  a}    -  1  -  P{X  <  a} 

c.  P{-»<  X  <.  a  },  a  <.  0 

m  .5  +  P{b  <  X  £  a} 
(Work  several  examples  using  the  Tables) 
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Topic  3.    Notation  for  Unit  7 
I •    Convent Ions 

!•    Population  values  denoted  by  Greek  letters 

-2  ^ 


2«    Sample  estimates  denoted  by  Latin  letters 
X,  8^,  r,  b 


II.  Variables 

i.    X»  Y  denote  variables;  x»  y  are  realizations 
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Lecture  6-0  _ 
Transparency  Presentation  Guide 


Lecture 
Outline 
Location 

Topic  1 
Section  !• 
2.h 

Section  II « 
1. 


1 
2 


l^wpareticy  Deacrlpt  Ion 


Inference  Problem 


Skills  to  be  Mastered 


Section  II. 
l.a 


3. 
4. 


3 
4 
5 


Gaussian  Probability  Distribution 
Calculating  Gaussian  Probabilities 
Normal  Curve  Areas 
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Alia  imder  the  tUnibrd  bonnal  curve  from  0  to    tbdwi  ibded,  &  A(t). 

Eximplea.  If  2_&  the  standard 
Hmal  flmdoa  variable  and  «  ^  IM, 
tben 

Ms)  m  P(0k2  <  g)  m  .4382, 
>(Z  >«)•>'  ^18 
FJ^  <     -  J382. 
^(1^1  <      -  ^764 


a 

.00 

.01 

.02 

.01 

.OS 

.06 

.07 

.08 

0  0 
0.1 
0.9 
0.3 
0.4 

•dbtib 

.09tl8 
.07V3 
.il79 
.1834 

.r«io 

.0632 
•  1217 
.1591 

:0^5b 
.U;78 
.0571 
.1255 
.1^ 

'  .0120 
.0517 
.IKIIO 
.1293 
.1064 

OIGO 
0557 
.0948 
.i:i3j 
.1700 

0199 
.0590 
.09S7 
.1308 
.1738 

.^9 
.0030 
.1030 
.1400 
.1773 

.0979 
.0075 
.1084 
.i443 
.1808 

.0319 
.0714 
.1103 
.j48d 
.1814 

am 

.0753 
.INl 
.1517 
.1879 

O.S 
O.i 
0  7 

O.i 

O.f 

.1915 
.9257 
.9580 
.9&I 
.3150 

.1950 
.891 
.9511 
.9910 
.3150 

.1985 

.2042 
.2939 
.321.* 

.3019 
.8357 
.2073 
.2907 
.3238 

.9051 

.tm 

.2704 
.9995 
.32«>4 

.9088 
.9422 
.2734 
.3023 
.3269 

.9123 
.3454 
.3784 
.30S1 
.3315 

.3157 
.3486 
.2794 
.3078 
.3340 

.9190 
.3517 
.2^ 
.3100 
.3305 

.9234 
.9549 
.9852 

.3133 

t.o 
1.1 

1.9 

ia 

1.4 

.3413 

.4032 
.4193 

3138 
.3005 
.3S(i9 
.4049 
.4307 

.3461 
.31«0 
.3>«S 
.4Ui*iO 
.42^ 

31S5 
3708 
.3907 
.40$2 
.4336 

^sm 

3729 
.3935 
.4009 
.4251 

,3531 
.3749 
.391) 
.4115 
.42i«5 

.3551 
.3770 
.3963 
.4131 
.4279 

.3577 
.37tl0 
.39S0 
.4147 
.42t^ 

.35^ 
.3810 
.3997 
.4102 
.4300 

:3810 
.4015 
.4iT7 
.4310 

1.5 

i.i 

1.7 
1.8 
1.9 

.43.13 
.4452 
.45M 
.4C41 
.4713 

.4315 
.44i;3 
.4.fi84 
.4019 
.4719 

.4157 
.4474 
.4573 
4050 
.4720 

.4370 
.44M 
.45S2 
.4001 
.4732 

.43S2 
.4495 

.4071 
.4738 

.4301 

.451*9 
.4078 
.4744 

.4400 
.4515 
.40bv 
4080 
.4750 

.4418 
.4535 
.4010 
4003 
.4756 

.4429 
.45:15 
.4^ 
.4099 
.4701 

.4441 
4515 
.4033 
:470C 
.4707 

9.0 
9.1 
9.9 
9.3 
9.4 

.4773 
.4821 
.411C1 
.46*J3 
.4918 

.4778 
.4X20 
.4hiil 
.4S'Jd 
.492U 

.4783 
.4M0 
.4S08 
.4S»S 
.4922 

4788 
I.S34 
4H71 
4!«U 
.4!^5 

.4793 
.4339 
.4«75 
.4fKM 
.4927 

.4798 
.4842 
.4^5 

.4'1'jd 

.4939 

.480r' 
.4840 

.4«m 

4850 
•ISSI 
I'M!  1 
.4932 

.4812 

.4S51 
.48S7 

t!ii:i 

.4934 

.4817 
.4857 
48*K) 
4910 
.49:10 

9.8 

to 

9.7 
9  8 
9.0 

.4938 
.4U53 
.4905 
.4974 
.4081 

.4940 
.4*.i;^5 
.4900 
.4975 
.41»2 

.4911 
.4950 
.4967 
.4970 
.4t«2 

.4943 
.4057 
.4908 
:4!I77 
.4983 

.4945 
.4959 
.4909 
:4977 
.4984 

.4910 
.4900 
.4970 
.4978 
.4W4 

.4948 
.49i«i 
.4971 
.4979 

.4ue5 

.4949 
.4902 
4973 
.4979 
.4985 

.4951 
.4!i03 
.4973 
.4980 
.4980 

.4952 
.4984 
.4974 
.4981 
.4980 

t.i 

•4987 

.4917 

.49i7 

.4938 

.4081 

.4989 



.49«> 

.mi 

.4980 

.mo 
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tectore  6-i.    Quantifying  Uncertainty  of  Estimates 

Quantifying  Uncertainty  of  with  Confiflence  intervals:  Interval 

Bounds  Between  vfilch  the  Population  Parameter  will  fall  with  a  specified 
frequency. 

I^ecture  Content: 

1.  Sampled  data  and  parameter  estimates 

2.  Quantifying  certainty  (or  uncertainty)  in  our  estimates 

Main  Topics; 

1.  Notion  of  Confidence  Intervals 

2.  Caicuiating  Confidence  Intervals 

(There  are  no  transparencies  for  this  lecture,) 
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Topic  1.    Motion  of  eoxtfidence  Intervals 

!•    General  Problem:    Sampled  data  yield  estimates  of  population 
parameters  which  will  hot  be  equal  to  the  parameter  .    How  can 
we  qomtify  the  certainty  (or  uncertainty)  that  ve  have  in  our 
estimate? 

II,    Sbititioh:    Gohfidehce  intervals-- interval  botsnds  between 
which  the  population  parameter  will  fall  with  specifiable 
probability. 

111.    Skills  to  master:    Confidence  intervals  for  mean,  regression 
coefficient »  correlation  coefficient »  proportion. 


IV.    Specific  Notions 

1.  Inference  goes  frbm_ sample  to  pppulation.    He  measure  a 
feature  of  the  sample  data  and  infer  the  value  of  the 
population  parameter  from  this.    Thus,  we  call  thi  sample 
statistic  ah  estimate  of  a  parameter. 

2.  Parameter  is  cbhstant.    But  from  sample  to  sample  measured 
estimate  can  vary. 

3.  Each  estimate  is  a  value  of  a  random  variable  whose  dis- 
tribution may  be  known  from  theory  or  assumption. 

Ex:    means  of  samples  are  Normal  in  large  samples. 

4.  Point  estimate:    single  value.    But  this  may  be  in  error, 
in  fact i  we  don't  expect  it  to  equal  the  parameter.  Simply 
reporting  the  number  gives  no  indication  of  how  close  we 
believe  the  estimate  is  to  the  parameter. 

5.  interval  estimate:    bounds  for  an  interval  containing^ the 
point  estimate  (hot  necessarily  symmetric)  which  we  know 
contains  the  parameter  with  certain  probability 

6.  The  probability  that  thi  interval  covers  the  parameter  is 
the  confidence  level.    The  interval  is  called  a  confidence 
interval. 

7.  Note  that  since  the  parameter  falls  inthe  fat erval  with 
certain  probability  <  1,  it  may  not  actually  be  fa  the 
interval.    (957.  confidence  means  1  in  20  chance  of  being  wrong). 
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Topic  2.    Calculation  of  Confidence  Intervals 


I.    Specific  Methods 

i«    The  norxnai  approximation  for  X. 

The  population  mean,  V  is  eBtimatcd       the  sample  mean, 
an  unbiased  estimate  of  \x  since  £(X)  -  p  . 

2.    What  is  its  "sampltag"  distribution? 

From  statistical  theory  icentral  limit  theorem) ,  regardless 
£f  distribution  of  original  x's,  frequency  distrlbuClon  of 
X  in  repeated  random  samples  of  size    n  tends    to  the 
Normal  as  n  » 


(Note  that  the  closer  to  normal  the  original  distribution, 

the  smaller  n  can  be  for  usibag  normal  approximation.  Others   

may  require  n  >>  100.    The  more  skewed  x,  the  larger  n  should  be) 

3.  What  is  its  standard  deviation?  (Of tehcalled  standard  _ 
error  because  it  Indicates  the  amount  of  error  in  using  X 
as  a  measure  of  p) • 


4.     Since  X  is  normally  distributed  (for  repeated  random 
samples  and  .  large  n)  and  we  know  its  expected  value  and 
standard  deviation,  wa  can  construct  a  standardized  normal 
deviate: 


5.    Now  we  can  specify  the  probability  that  X  lies  between 
two  limiting  values  t,  and       by  determining  the  pro- 
bability thatZ^  lies  between       and       and  this  we  do 
by  examining  ^    table  of  the  percentage  points  of  the 
cumulative  normal. 
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II.    Confidence  Intervals  for  y:  o  known. 
(Example:    iQ  scores) 

1.  Random  sample  of  size  n^  X  -  N  (y,  a/A) 
If  Y  is  drawn  from  N(v»cy) 

P{p  -  1.96a  <.  y  1  M    +  1.96a}    -  .95 
Thus,  if  X  is  drawn  from  N(M,a/v^) 

Ptp  -  1.96a/i^  1  X  <.  y    +  1.96a/i/ni  *  .95 
or 

(X  -  1.96ar^  ±  y£X  +  1.96a/i^  )  is  our  interval. 

Thus  the  95%  confidence  interval  for  y  is 
X  +  1.96a/»^ 

Since  +2.58  contains  99%  of  the  standard  normal,  99% 
conf.  limits  are 

X  +  2.58a/»^ 

In  general,  p%  confidence  levels  are 

X  +  i  a//n 
"  P 

where  Z  is  a  value  in  the  cumulative  normal  table  such 
that  thi  area 

(One  sided  tests  use       such  that  area  is  P.) 

2.  Sample  size 

We  want  estimate  accurate  to  +L  set  probability  of  X 
lying    between     ±  L  -  .95*  say. 

Then 

P{p  -  L  <  X  1  y    +  L}  «  .95 
But    P{p    -  1.96a/»^  <.X  1  y    +  1.96o/t^}    -  .95 
Thus  L  -  1.96a/t^ 
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3. 


Making  1.96  ^  2  ve  have 

for 
for 
a  unknown 


n  i  4a^/t^ 
n  -  6.6a^/L^ 


95Z  prbb, 
99%  prob- 


a.    Use  8  as  estimate  of  a.    It  Is  based  on  (n-1)  degrees 
of  freedom.  Nov 


8//n 


t  differs  from  N(O^l)  only  vhen  n  £   30.  . 


4. 


b.    PZ  confidence  intervals 

-  X  -F  tp^j^oQ.iyj  s^^'*"^  t  "i-l 
Regression  coefficient 
b.  - 


on  n-k  df 


How  is  computed? 


var  (6) 


with  univariate  regression, 
through  the  orlglti. 

use  residual  variance 


/  n-2 


/numerator  is  standard  error  of  regression^ 
(  denominator  is  (n-l)  standard  error  of  X^J 
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Then  CZ  confidence  Intervals  are 
"       ±  *^c/100.1/2  V 


i 


Correlation  coefficient 


Fisher's  r-*-Z  transformation 


Z  -  I  £logg  (1+r)  -  logg(l-r)]  -  N 


•^^""^^  .        £(X-Xj(Y-7)  Cov(X.Y) 


Then  p%  confidence  interval: 

P  -  2  ±  Zp/100.1/2  ^'^^ 
Proportion 

if  r  of  n  Save  attribute  proportion  in  population  , 

p  «  r/ti.    When  n  »30,  p  is  N(p>  PQ/n) ;  use  p  and  q("l-p) 

to  estimate  p,q  and  c%  confidence  interval 

^  -  P  *   V/lOO-l/2  ^ 
When  n  is  small  use 

(Careful  when  p  and  q  are  not  near  .5  and  n  is  <  75) 
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Lecture  6-2.    Significance  testing 

Significance  Testing:    Detenainlng  the  reasonableness  of  a  hypothesis 

Lgctare  Content: 

1.    Null  hypotheses:  Hq 

£•    I>etermihihg  whether  to  reject  or  not  reject  Hq 
3.    Significance  Levels 

BBixi  Topics: 

1.  Significance  tests:  Concepts 

2.  Significance  Tests:  Techniques 
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Topic  1.    Significance  Tests:  Concepts 

I.    Prbblem--S6pietime8  a  point  value  for  a  population  parameter  is 
assumed  or  hypothesized.    How  can  samp:^      ssults  be  used  to 
test  the  reasonableness  of  the  hypo  the  • • 

II.    Splutlbn~Slgnlficance  teBt--calculate  «i_test  criterion  from 
the  sai^le  dataj  If  It  falls  into  a  reelon  of  rejection^  the 
null^ hypothesis.  I.e.,  the  hypothesize^  value  for  the  parameter, 
is  rejected  and  the  departure  Is  called  statistically  significant. 
If  the  null  hypothesis  is  true  the  test  has  a  known  probability 
of  obtaining  a  significant  result  which  is  called  the  significance 
level  of  the  test. 

III.    Notion  of  a  null  hypothesis 

A.  Considerations 


1.  This  is  a  statistical  hypothesis,  anassertion  that  the 
populatlbh  parameter  has  a  certain  Y?^?®  •_  .^^^-^^ 

null  because  the  assertion  is  that  there  Is  no  difference 
between  the  hypothetical  value  and  theparameter's  actual 
value*    This  is  nonetheless  hypotheticaibecause  we  have 
no  evidence  (yet)  that  the  hypothesized  and  true  values 
are  equal. 

2.  This  leads  to  a  decision  makingsituation.    We  want  to 
construct  a  procedure  with  which  we  contrast  the  null 
hypothesis  with  evidence  drawn  from  sample  data. 

3.  If  the  value  computed  from  the  data  is  "very  different" 
from  the  null  hypothesis  we  reject  it.    If  it  is  "similar" 
we  do  not  reject  it • 

4.  The  null  hypothesis  can  describe  a  single  parameter, 
siich  as  a  regression  coefficient  or  a  difference  in 
parameters,  such  as  the  difference  in  means. 

B.    Notion  of  a  rejection  region 

1.    How  can  we  specify  what  is  a  "very  different"  value 
lending  to  rejection  or  a  "similar"  value  leading  to 
non-rejection? 


2.    Use  probability.     If  we  know  the  sampling  distribution 
of  the  estimate  under— the^ull  hypothesis  then. we  can 
compute  the  probability  of  observing  a  value  like  that 
computed  from  the  sample  data.     (Exactly,  in  fact.) 
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3.    When  this  probability  is  quite  small,  we  can  argue  that 
that  the  observed  value  is  tihlikely  to  be  an  observation 
on  a  random  variable  with  parameter  equal  to  the  null 
hypothesized  value.    Thus  the  data  lead  us  to  reject  the 
null  hypothesis  or,  we  say, they  fail  to  confirm  it. 

Example: 


Y  may  have  arisen  from  f  (Y/Xq)  but  such  a  value  is  al- 
most a  rare  event  in  this  distribution.    We  must  decide 
whether  the  rare  event  occurred,  or  whether  Xg  seems 
reasonable  as  the  parameter  of  the  distribution. 


4.    We  can  establish  levels  which  bound  small  areas  of _^  

probability  such  that  if  the  observed  value  fails  beyond 
the  inner  bounds  of  the  levels  we  agree  to  reject  the 
null  hypothesis.    These  are  the  rej ection^egions . 
Between  the  limits  lies  the  acceptance-region. 


5.    if  we  have  no  idea  oh  wiiich  side  of  the  null  hypothesis 
the  sample  value  willfall  (disjoint  alternative  hypothes 
we  heed  two  levels.    If  we  have  a  prior  idea  (one ^ 
alternative  hypothesis)  we  need  only  one  level,    in  the 
first  case  we  have  a  two  tailed  test:,  two  rejection 
regions  in  the  tails  of  the  assumed  distribution.  In 
the  latter  case  we  have  a  one  tailed  test. 

Notion  of  power  of  a  test 

1.    The  estimate  of  the  parameter  calculated  from  the  sample 
is  an  observation  on  a  random  variable  whose  sampling 
distribution  is  known  when  the  null  hypothesis  is  true. 


m.iiiiiif 


2.  Since  the  pioBabillty  of  observ^g  a  Sire  valuf  ^is  #Ball 
but  nonetheless  positive  sometimes  a  value  falling  in 
the  rejection  region  will  be  laiselv  rejected. 

3.  The  probability  of  rejecting  a  true  null ^hypothesis  is 
the  sienificancej^vel  of  the  test,    it  is  equal  to  the 
area  in  the  rejection  regions  and  is    1  minus  the  confi- 
dence  level. 

4.  The  probability  of  rejecting  a  false  null  hypothesis  (X) 
is  the  power  of  the  test,  v2) 


D.    Types  of  errors  in  decisibn  making 


Hi 


0 

not  R 


I 

Power 

a 

i-B 

II 

1-a 

to  reduce  type  I  error  (a)  increase  confidence,  i.e.,  min.a 


To 


reduce  type  II  error  (B  )  increase  power,  i.e. 


mtn.  B 


There  is  a  trade  off  between  type  I  and  type  II  errors,  in 
general  when  one  decreases,  the  other  increases. 
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Topic  2.    Significance  Tests:  Tectaiques 
i.    General  procedure: 

1.  Determine  test  statistic 

2.  Establish  null  hypothesis 

3.  Determine  sampling  distribution  of  test  statistic  under 
hull  hypothesis. 

4      Set  levels  for  rejectibn  (or  simply  report  p  value)   

usually  .1,  .05,  .01.  (Discuss  looking  up  critical  values 
in  appropriate    table  of  the  distribution) 

5.    Perform  test 

II.  Examples 

1.  For  y;  o  known:    H-  =  p_ 
compute  Z^^  =  —  =  — 

Is  jz_  j  >  Z    ?    If  yes, reject  Hq  at  a  . 

2.  For  y;  o  unknown^  ^  ^0  ~  ^0 

use  s/v^to  estimate    o/Vn  and  proceed  as  above. 

3.  For  y;a  unknown,  n<  30:    Hq  =  y^ 

X  -  y^     X  - 


t  ^ 


I  «   where  t  has  (n-1)  df 


4.     Two  means  from  independent  samples 


Note : 


a.    Difference  between  two  normally  distributed  random 
variables  is  normal 


b.    ol    ^    ^  a§    +  ol       where       &        are  i.i.d. 
12  1  2 


(X,  -  Xo)  -  (y.  -  yg)   1   ^  ^ . 

1    ±  ±-         is  test  statistic 


Then  ^ 


c.     If  the        are  drawn  from  some  iiopulation  n^^  -  n2 


with  variance       known  then 


B22 
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With      jcnbwri  o  use.  Z 
Wlfch  unknown  cf  use 

pooled       ■  (8 J  +  82^/2;  t  has  2(n-l)  df 

2  2  ,  i. 

d.    nj  .fe  n2  then  a-^  -  *2  "  °  °1  "2 

and  when  a  is  unknown 
a    unknown  use 

8-    ^    -  —  +  —  where  t  has  n.  +  n«  -  2  d.f. 

Correlation  -coefficient 
Use  Fisher's  transformation 

B  =  I  [loggd  +  r)  -  iogg(l-r)] 
1 

which  has  ~ 


and  proceed  as  with  norialiy  distributed  test  statistic. 
Regression  coefficient 

^i 

where  t  has  (n-k)  d.f.  and  k  is  totr  ■  ,iu'.:.?.er  ox  parameters 
heing  estimated. 

Regression 

R^/(k-l)  p 

if  no  linear  relationship  exists  between  y  and  X. 

where  k  is  number  of  parameters  estimated 
and  n  is  saoqple  size. 
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1^    Of  the  3.017  families  in  Ellwood  City,  PA  in  1970^.  ^  randan  sample 
6f  300  families  was  taken  to  determine  the  mean  family  income.  A 
95%  confidence  interval  ($8,812  to  $9^116)  was  established  on  the 
basis  of  the  sample. 

Using  only  the  above  infonaation  conment  on  the  truthfulness  of 
the  following  statements: 

a)  Of  all  possible  samples  of  size  300  drawn  from  this  population, 
55%  will  have  sample  means  between  $8^812  and  $9,116. 

b)  Of  all  possible  samples  of  size  300  drawn  from  this  population, 
95%  will  have  population  means  between  $8,812  and  $9,116. 

c)  Of  all  possible  samples  of  size  300  drawn  from  this  population, 
95%  of  the  confidence  intervals  will  contain  the  true 
population  mean. 

d>    95%  of  the  families  in  Ellwood  City  have  incomes  in  the  range 

$8,812  to  $9, lie- 
as    One  can  always  decrease  the  width  of  a  COTfidence  interval  by 

increasing  the  sample  size.    Why  then  does  one  not  always  determine 
the  desired  width  and  sample  accordingly? 

3)    Suppose  you  are  interested  in  the  proportion  of  families  In.  the 
tfiited  States  that  have  4  or  more  diildren.    Let  the  true  popula- 
tion proportion  be  P*    Since  your  office  does  not  have  a  copy  of  the 
current  Statistical  Abstracta,  you  are  instructed  to  estimate  P 
based  on  a  very  small  sample  of  1,000  families ^ 

a)  tet    p    be  the  estimate  of    P    from  your  sample.    What  is  the 
(large)  sampling  distribution  of  p? 

b)  Suppose  we  found  a    p    of  ilgS.    Construct  a  90%  cbhfidence 
interval  about    p    based  on  these  sampling  results.    What  do 
you  report  to  your  supervisorcbncerning  the  true  population 
proportion?    In  policy  ccmtext,  when  would  the  point  estimate 
p    be  preferred  to  the  90%  confidehce  interval? 

c)  Suppose  that    h    is  quite  small,  and  P  «  .90.    Explain  why  the 
sampling  distribution  of    p    will  be  asytdmetric,  and  tell  your 
supervisor  why  the  Gaussian  approximation  is  inaccurate  in  this 
instance* 

927 


XVI. III.  126 


ERIC 


Module  III 


Suppose  that  as  an  employee  of  HEW>  you  are  studying  the- effect  of 
the  apparent  decline  in  intelligence  exhibited  by  high  schboi  upper- 
classmen  on  the  allocation  of  federal  funds  to  public  schools;  You 
have  SAT  scores  for  high  school  seniors  throughout  the  country  for 
1966-1975. 

It  is  a  well  known  fact_that  an  individual's  score  on  the  Sfatheraatics 
Scholastic  Aptitude  Test^  administered  by  the  College  Entrance  Exami- 
nation Board^  is  a  random  variable  with  mean  500,  standard  deviation 
100;  moreover,  for  ail  but  very  specific  purposes,  it  is  Gaussian. 

a)  What  is  the  probability  that  an  individual's  score  is  greater 
than  700? 

b)  What  is  the  probability  that  a  score  is  between  400  and  650? 

c)  Two  sisters  eac'i  have  sc  ores  between  500  and  550.    What  is  the 
probability  of  the  simultaneous  occurrence  of  these  two  seemingly 
independent  events? 

d)  Your  supervisor  states  that  the  single  assumption  you  used  to 
calculate  the  probability  in  (c)  (independence)  is  not  at  all 
correct.  Why? 

You  are  conducting  a  study  for  a  dean  of  a  highly  regarded  school 
of  public  policy  into  the  ages  of  incoming  master  of  public  admin- 
istration students.     Your  data  consist  of  22  students. 

:  _  —  _  _       _       2  -    z::2  ---- 

a)  You  find  that  X  =  24.5  years ,  and  S    =  2.1  years  ,  with  n  =  22. 

GoDStruct  a  92%  confidence  interval  md  interpret  your  results. 

b)  In  what  way(s)r  is  your  confidence  interval  similar  to  a  hypothesi 
test? 

The  National  Training  and  Development  Service  has  kindly  given  you 
data  on  the  evaluation  of  195  proposals  for  curriculum  development. 
Each  proposal  is  submitted  to  one  of  8  need  areas. 

four  data  analysis  reveals  that  page  length  and  the  indicator  variabli 
tor  need  area  3  are  important  determinants  of  the  final  score  awarded 
to  a  proposal . 

T r?-:igjession  model  of  total  summed  scbt'e  oh  p^ige  length,  (page 
length)   ,  and  need  area  3  was  cohsl  ruc  red.     Resul.:s  are  given  below: 

^  Indicator 
Pages      (i^^e8)  I'levJ  Ai^e^i  3  Constant 

eoefficient    _         1.54(5      -.024  -r,685  41.884 

Standard  error  .271        .3058  2.014 

«  0.19 

p    =  195  V  =  total  suTtiKivtrd  score  tHax  ^  90) 
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a)  Test  the  hypothesis  that  the  model  relating    y    to  the  4 
carriers  is  hot  additive,  and  hence  that  no  linear  regres- 
sion exists. 

b)  Place  confidence  intervals  about  the  least  squares  coefficients 
for  pages  and  (pages) ^. 


c) 


eommeht  on  the  results  of  (a)  and  (b).  How  can  you  explain 
the  rather  contradictory  finding  of  such  a  small  R^?  Would 
a  stem-arid-leaf  display  of  the  residuals  be  useful? 


7) 


8) 


9) 


iS  a  random  sample  of  1,000  individuals,  600  were  in  favor  of 
Capital  punishment.     Test  the  hypothesis  (a  =  .10)  that  individual 
attitudes  in  the  population  are  equally  divided  for  capital  punish- 
ment and  against  it. 

You  have  access  to  the  grade  reports  of  9  students  in  the  class. 
You  find  that  the  saSple  correlation  between  undergraduate  6PA  and 
fall  term  OMPM  grade  is  only  0,15.    Can  you  conclude  that  there  is 
ho  relationship  between  these  2  variables? 

A  linear  regression  model  relates  the  response^igration  from 
33      SMSA's  with  populations  greater  than  500,000  to  3  carriers: 
welfare  payments  per  capita,  tmmigratiOTi  into  the  SMSA,  and 
average  annual  temperature. 


The  results; 


Goefficlent       t-statistic      standard  error 


Constant 

Welfare 

Immigration 

Temperature 
m  0.00482 


-0.1978 
0.3324 
0.0046 
0.0026 


17.94 
2.14 
1.34 


0.0185 

0.00215 

0.00192 


R'^  -  0.9394 

Comment  oh  these  results  by  constructing  hypothesis  tests,  with 
a  -  .05. 


10) 


in  a  sample  pf  400  ptofessbrs^  you  find  that  the  average  annual 
salary  is  $23|20P|  with  a  standard  deviation  of  $4,000.  Test 
the  hypothesis  that  the  population  value  is  $25,000.    Let  the 
probability  of  a  Type  1  error  be  .10. 


92y} 


XVI. III. 128 


Module  lii 


Homework  Soiuttons 
tinit  6 


The  only  valid  statement  is  (c) .    This  is  precisely  what  we  mean  by 
rgsrconfidence  interval.    Remember ,  we  are  examining  a  conf tdenca 
interval  for  the  one  (only  one)  population  mean.    We  calculate  the 
interval  using  a  random  sample. 

The  cost  of  taking  a  larger  samplemay  be  uneconomical  in  terms  of 

Stum  on  the  sample  info       or  th«  «^l^«,|^S,Sl    etc  col- 
other  factors  such  as  physical,  time,  moral / ethical ,  etc  con 
stratnts.    We  usually  predetermine  n  as  the  largest  sample  size 
possible  within  time,  cost,  availability,  etc.  constraints. 

(a)  A  norial  distribution  may  be  used  to  approximate  the  sampling 
distribution  of  p.    Although  the  ratio  fP-?<q  being  the 
percentage  of  families  with  fewer  than  4  children)  is  likely 
to  be  considerably  less  than  .5:. 5,  the  sample  size  xs  _ 
sufficiently  large  to  cunt. r  any  --«l^^"|^ff f^^^-^S"'" 
that  although  l.OOC  iamiii2>^  is  a  small  sample  of  the  total 
ISer  of  families        the  country,  it  is  a  ^^  ^^^'^^^ 
the  standpoint  of  developing  sampling  f  ^^^''r^SJdarl  levU- 
tribution  will  hav^  An  estimated  mean  (v^)  and  standard  devla 
tion  (Op)  of  P  and  >^||2_  respectively. 

(b)  we  have  been  given  the  sample  size  (10003  and  the  sample  mean 
(b)    we  nave  D       s  determine  the  critical  value 

^Lnd^les  (limit,  of  the  estimates  of  the  m^«)  ^^hin^lch 
we  can  be  90%  sure  that  the  true  populatio^^mean  will  fall. 
The  calculations  of  these  values  are  as  follows: 


=  1.65 

05 


pi(p  -  zvfa  <  p  <  p  +      )  = 

.125  -  1.65  /. 00011  IP  1  -125  +  1.65  /.OOOll 
.108  1  P  i  •l'*2 

we  prefer  a  point  estimate  of  P  whin  we  need  to  mke  d^isions 
bised  on  a  particular  value  of  ^     f '  ^-/"^^^^^^"f 
budgeted  to  provide  a  good  to  every  4  child  family.) 

ic^    The  shape  of  the  sampling  distribution  depends  on^sample  size 

1^  ais^Siorso  that  the  sample  distribution  of  the  percentage 
approaches  the  normal. 
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4.     (a)    Let  X  *  the  individuai*s  score  .    ^    ^  ^    i       /v  ^  700% 

m  want  to  find  Pr(X  >  700)  which  is  equivalent  to  l-Pr(X  e  766) 
can  subtract  the  popuiKtion  mean  and  divide  by  the  standard 
deviafcton  on  either  side  of  the  inequality.    Since  Z  is  of  the 
fpnn^=-H-,  the  probability  determination  can  readily  be  made. 
The  calculations  are: 

Pr  (X  >  700)  -  1  -  Pi(X  <  700) 


t      Pt(?lJiJ22.   <  700  -  500  ) 
^    ^  100  lOO 

-  i  -  .9772 

-  .0228 

(b)    By  the  aasae  logic  as  above 

 AQO  -  590       X-  500         650  -  500x 

Pr(400  1X1  650)  -  Pr(^22^— ^  <  -— -    <   ) 

-  Pr(-1  1  Z  <.l.- 

«  .3413  +  .4332 

»  .7745 

■(c)    First  we  find  the  of  occurrence  of  a  Score  between  500  and  550 

^  ,500—  500  ^  X  -  500    ^    550  -  500  < 

Pr(500lX<550)  =  Pr  (502__L_  <  __-    <   ^   ) 

=  PrCO^Z  £.5 

=  .1915 

If  the  events  are  truly  independent  then  the  probability  of 
their  simultaneous  occurrence  is  ^he  product  of  their  pro- 
babilities.   Since  both  events  have  probabilities  of  .1915, 
the  joint  probability  ±b:     (.1915)^  or  .0367. 

fd)    He  is  riiht  because  the  sisters  share  siiliar  genetic  5iake- 
Sp  and  environmental  experience.    Both  factors  xan  Influence 
intelligence.    The  events  are  therefore  not  ^^ependent  and 
the  probability  of  both  scores  being  between  500  and  550  is 
probably  greater  than  .0367. 
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the  t-distributioh  is  required  here  since  the  sanple  size  is 
small.    The  confidence  interval  is  constructed  as  follows: 

24. 5  ^  2.1  (J^)  1     1  24.5  +  2.1  C 
23.35    1  y  1  24.65 

There  is  a  92%  chance  that  this  interval  will  contain  the 
triie  meaii  of  ages. 

To  teat  whether  a  given  population  mean  is  the  same  as  another 
a  confidence  interval  may    be  established.    This  interval 
corresponds  to  the  region  of  "non-rejection"  of  the  null 
hypothesis. 


The  null  hypothesis  to  be  tested  is  H^:  R  =  0 
The  F-distribution  is  appropriate  here. 

„2 

— ^  -  .  Hie  » 

,  ^  ^2      p-l      .81    *  3 
-  1^.93 


*"3,191;.05  " 
Reject  Hq.    There  is  a  linear  relationship. 

Pages 

(1.54  -  t  05(. 271)5  <  (1.54  +  t  q5(.271)) 

(i.S4  -  1.96(. 27155  iBj^  1  (1.54  +  1.96(.271)) 


1.00         ±  2.07 


^  2 

Pages 


(-.024  -  1.961.005855         ^  (--024  +  1.96(.0058)) 


-.035         <_  -.013 

Neither  confidence  interval  contains  0.    Hence  both  carriers 
explain  a  portion  of  the  total  variation.    Alsoj  the  additive 
model  is  consistent  with  the  data  since  N  is  large,  even 
though  r2  is  small.    An  examination  of  residual  plots  would 
be  informative. 
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.      p  _  . 
7*    targe  sample »  testing  for  q  " 

Hq:  P  -  .5 


H^:  P  #  .5 


.6-. 5  -  >^ 
y(.6)(.4>_  .015 

1000  2     -  1.65 


6.67 


.05 


since  calculated  Z  exceeds  table  Z    ,  reject  Hq 


8.    Hg:  P  ■  b 


H,  :  p  ^  0 


.15 


Z  X.  -  1.96 


.41  -05 
-  .37 

Since  calculated  Z  <  Z      ,        cannot  be  rejected 

9.    * »  2.04    Since  the  computed  t-statisticS  for  Welfare  and 
llfei^ltlcn  are  both  greater  than  2. 04  *  they  are  "significant.' 
Temjpeirature  is  not  significant  at  the  5Z  level 

r2     S^p      .9394  31 
^-p-  •  "FT  "  .0606  •  2 

a  -UK  e  .  15  5  F  "3.32 

«  15.5      XD.D  2,31i.05 

-  240.25 

There  is  clearly  a  linear  relationship 

migration  -  Constant  +  6^  Welfare  +  I^igration 

lb.    H«:    P  -  25,000 
Q 


B  :    11  i>  25,000 
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23f2Q0  -25.000 
4000 

/400 

-  -1800  z      -  1.65 

200  '05 

-  -9 

■  -9  <  -1.65    /.  reject  H 


ERIC 


934 

XVI i III i 133 


mm 


Unit  6 
Quiz 


Name: 


Writs  all  your  answers  bii  theise  pagies.  Point  totals  are  given  in  par^ 
mthMii  prior  to  each  question.  You  have  forty  (40)  minutes  for  this 
quis. 

(A5)    1.    You  have  estiaiated  a  linear  iodel  relating  the  response  single 
family  housing  starts  in  Pittsburgh  (Y)  to  carriers  Median 
Price  of  a  new  unit  in  thousands  (Xj_) ,  7.  Unemployment  (X2), 
and  Population  in  thousands  (X3) .    You  collect  data  for 
1960-1972.    The  model  is 


y  -  »o  +  ^1^1  *  ^2^2  *      ^  ^• 


Least  squares  estimates  of  the  parameters  are 
% 

b  - 


and  diagonal  terms  of  the  Variance-Covariance  Matrix  are 

,8 


38  X  10 


8  (X'ft' 


-1 


4.48 X  10- 


2.70  X 10" 


5.04x10 


and  R*  "  .902. 
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Last  term  we  istated  that  in  testing  the  importance  of  a_  

coefficient  in  a  linear  model,  you  should  consider  the  carrier 
••important"  if  the  t-statistic  was  greater  than  2  or  3  in 
absolute  value.  Why? 


Does  the  carrier       differ  from  zero? 


What  hypothesis  do  yoa  test  to  determine  whether  or  hot  the 
response  is  linearly  related  to  the  set  of  carriers  as  a 
whole?    Under  the  null  hypothesis,  the  test  statistic  is 
distributed  as  a  specific  random  variable.    Which  distribu- 
tion is  it,  and  why  is  this  the  correct  one? 
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(25)    2.    Your  supervisor  states  that  5%  of  the  census  tracts  in 

Pittsburgh  have  median  family  size  greater  thah  6  ihdi\dau^ 
family.    In  disbelief,  you  gather  data  oh  the  86  census  tracts 
and  find  that  median  family^ size  per  tract  is  remarkably  well 
behaved,  with      «  4.5  and       =  .20.    Is  your  supervisor 
correct?    Why  or  why  not? 


(30)    3.    The  cMnputer,  center  at  Robber  Baron  University  claims  a 

95%  availability  for  their _H&L-250  computer.    You  are  some- 
what skeptical  of  this  statement,  so  you  gather  data  for 
the  30  days  that  you  used  the  system  for  your  latest  paper, 
You  calculate  the  average  availability  to  be  857o  with 
associated  standard  deviation  -S-  of  57o. 


(a)    Construct  a  95%  confidence  interval  for  the  true  percentage. 
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Based  on  this  interval j  state  atid  t..rc  a  hypothc-^:cr*  «  iOS] 
to  determine  the  truth  of  the  cot  -rut center's  «'i3rtion. 


Are  the  distributional  assumptions  that  you  made  to  test 
the  hypothesis  in  (b)  appropriate?    Why  or  why  not? 
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TABLE  ill 
riie  Normal  Dinrlbu^on 


Pr  \X  ^  x)^  N  {x)  = 

1  -4= 

J-«  Vim 

\1>I\-~X]  —  1 

-  iV(x)] 

— 

  —  - 

-  - 

X 

X 

- 

X 

_  __  . 

N(.*) 

6.00 

0.500 

1.16 

6.864 

- 

2.65 

0.980 

0.05 

0.520 

j.lS 

6.875 

2.10 

0.982 

0.10 

1.26 

6.885 

2.15 

0.984 

e.js 

1.25 

6.894 

2.20 

0.986 

0.26 

0  S79 

1.282 

6.900 

2.25 

0.988 

0.25 

6:599 

1.36 

6.903 

2.30"^ 

0.989 

6.36 

6:618 

1.35 

6.911 

2.326 

0.990 

6.35 

6:637 

1.46 

6.919 

2.35 

0.991 

6.46 

6.655 

1.45 

6.926 

2.40 

0.992 

6.45 

6:674 

1.56 

6.933 

2.45 

0.993 

6.56 

6.691 

1.55 

6.939 

2.50 

0.994 

6.55 

6:709 

1.66 

6.945 

2.55 

0.995 

6.66 

6:726 

1.645 

6.950 

2.576 

0.995 

6.65 

6.742 

1.65 

6.951 

2.60 

0.995 

6.76 

6:758 

1.76 

6.955 

2.65 

C  -96 

6.75 

6.773 

1.75 

6.960 

2.70 

0.9v7 

6.86 

6:788 

1.86 

6.964 

2.75 

6.997 

6.85 

6:862 

i.85 

6.968 

2.80 

6.997 

6:96 

6:816 

1.90 

6.971 

2.85 

6.998 

6.95 

6:829 

1.95 

6.974 

2.90 

6.998 

1.66 

6.841 

t.960 

0.975 

2.95 

6.998 

j;65 

6.853 

2.60 

6.977 

3.00 

6.999 

from:    Hogg,  S.  V.  and  A.  T.  Gralg,  Introduction 
tj^  Mathematical  Statistics,  Third  Edition, 
New  York:    Maciillan,  1970. 
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Quiz  Unit  6 
Solutions 


a)  We  have  beeti  implicitly  testing  the  hypotheses  that  otir  P 
coefficients  ate  zero.    We  know  that  (B-B)/5g  is  distributed 
as  a  t  random  variable  d^grecJ^  of  freedom 

<N  observations,  p  variables).  large  N-p  and    l^^b^  « 

95%  of  the  I's  will  fall  in  the  interval  (-^2,2) .    A  t^statistic 

outside  of  that  interval  allows  to  reject  the  hypothesis  that 

^sQ.    For  smaller  degrees  of  freedom,  we  use  the  larger  interval 

(-3,3). 

b)  Hq:    P3  =  0 
Hj^:    ^3  /  0 

^3'°      -19  1-0 
t  =  -f  —  =    ^^^^  ^   ^  -2.7 

3  n/50.4 

Since  we  have  only  13  observations --the  years  1960-1972-* 
we  use  the  confidence  interval  (-3,3).    The  t-statistic  is 
within  this  range,  so  we  catinot^  say  that  X2  is  significantly 
different  "from  zero. 

c)  Hq:         =  0 
Hj^:         >  0 

The  test  statistic   ^  F„-, 

(1-r2)/(n-p)  p-^' 

Again,  there  are  N  observations  and  p  variables. 

2  -  2    -   --     -  -  --  --   ___  --- 

R    and  (1-R  )  are  ratios  pf  s^nis  of  squares^    Each  has  t^e 

sa^e  denominator.    Hence_ their  ratio  is  a  ratio  of  two  x 

randOTi  variables.    The  ratio  of  two  x    8»  divided  by  their 

degrees  of  freedom,  is  distributed  F. 


Median  family  size Gau  (4.5,  .20) 

2  1^  median  family  size  -  4.5  ^  (g^^) 
V  .20 


Pr  (Median  family  size  >  6) 
^  Pj.  I^Median  f amity  Btze  -  4>5   ^  6  -  4^5"^ 
^^io  "J  .20 
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-  fr(Z  >  3.33) 

<  ^OOi  (froffl  the  normal  probability  table) 

Our  supervisor,  v.     claims  that  P:(Median  family  size  >  6) 
i  .65*    is  wrong. 

Alternatively,  if  we  note  that  6  is  more  than  3  standard 
deviations  from  the  mean  (median  =  mean  in  a  well-behaved 
batch) ,  we  know  that  our  supervisor  has  overestimated  the 
frequency  of  median  family  sizes  greater  than  6. 

a)    With  a  large  number  of  observations,  a  95%  confidence  interval 
is  described  by 


.85  +  1.96  (.05) 

(.752,  .948)  is  the  95%  confidence  interval. 


b)    fig:    P  =  .95 
H^:    ^  ¥  .95 


Since  the  confidence  interval  that  we  cm      icted  in  part 
(a)  is  bur  acceptance  region  when  a  «=  .05,        reject  I^- 
We  disagree  with  the  computer  center. 

c)    We  have  relied  on  thi  assumption  that  our  data  are  app^SxSaately 
normal.    However,  the  distribution  is  very  skewed  with  a  p  of 

85  or  .95.     (p  is  bounded  by  1.)    In  light  of  the  dkewness , 
30  is  not  a  large  enough  sample  size  to  justify  our  assumption. 
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Unit  7 
Reading  Assignments 


Lecture 


Reading 


7-0 


Warwick  and  Llninger,  chapters  1^  2,  3 
Davis,  "Are  Surveys  Any  Good..." 


7-1 


Warwick  and  Llninger,  chapters  6^  7,  8; 
Sudman,  "Sample  Surveys". 


7-2 


Warwick  and  Llninger,  chapters  4^  5; 


Optional: 

eoieman,  et^al.  "Relation  of  School  Factors. . ."; 
Duncan,  VMeasaring  Social  Change..."; 

Featherman  and  Hauser,  "Design  for  a  Replicate  Study..."; 
Stokes,  "Some  Dynamic  Elements..." 
Winsborough,  "Age,  Period  and  Cohort..." 


References: 

Warwick,  B.  P.  and  C.  A.  Liningeri  The  Sampl^^urvey ;  — Theory 
and  Practice,  McGraw-Hill^  1975. 

Davis^  J.  A.,  "Are  Surveys  Any  Good,  and  if  so,  for  What?"  in 
Perspectives  on  Attitudes  Assessment:     Surveys  and  Their 
Alternatives:    Proceedings  of  a  Conference,  Smithsonian 
Institution  Technical  Report  #2,  August  1975,  National 
Jechnlcal  Information  Service  //  AD-A014321;  pp.  41-47  . 

Sudman,  S,  "Sample  Surveys",  in  Annual  Rnev^^w  of-  So G.iolo 

Volume  2,  Edited  by_A.  Inkeles ,  J.  Coleman,  and  N.  Smelser. 
Palo  Alto:    Annual  Reviews  Inc.,  1976.     pp.  107-120. 


Optional  Readings : 

The  fpllbwihg  three  articles  appear  in  Social-fed jcatd^  Models , 
edited  by  K.  C.  Land  and  S.  Spllerman,  Russell  Sage  Foundation, 
New  York,  1975. 

Duncan,  0.  D.,  "Measuring  Social  Change  via  Replication  of 
Surveys",  pp.  105-128. 

Featherman^  D.  L.  and  R.  M.  Hauser,  "Design  of  a  Replicate  Study 
of  Social  Mobility  in  the  United  States",  pp.  219-252. 
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Winsborough.  H.  H..  "Age.  Period,  Cohort,  and  Education  Effects 
qii  Earnings  by  Race—An  Experiment  with  a  Sequence  of  Cross- 
Sectionai  Surveys j"  pp.  201-218. 


rae  following  two  articles  appear  in  l^j;;:^it^e  Analysis  ■ 
bf^eia^'roblems.  edited  by  Edward  R.  lai  te,  Addison-Wesley 
Publishing  Company,  Reading,  Massachusetts,  1970. 

Coleman,  James  S.,  Ernest  Q.  Campbell,  Carol  J._Hobson  James 
McPartland,  Alexander  M.  Mood,  Frederic  D.Weinf eld,  and  ^ 
Robert  L.  York,  "Relation  of  School  Factors  to  Achievement  and 
Integration  and  Achievement",  from  Equality  of  Educational 
dpportuaity^ 

Stokes,, Donald  E.  "Some  Dynamic  Elesnents  of  Contests  for  the 
Presidency". 
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Lecture  7-6.     introduction  to  Unit  7 
introduction  %:o  Unit  7 — Sample  Surveys 

Lecture^Gontetit  t 

1.  Definition  of  a  Sample  Sui 

2.  Examples  of  Sample  Surveys 

Main  Topics : 

1.  What  is  a  Survey 

2.  Components  of  a  Survey 

3.  Motivation  for  ebnducting  a  Sample  Survey 

(There  are  no  transparencies  for  this  lecture.) 
Reference:    Warwick  and  Lininger,  Chapters  1-3 
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Topic  i.    What  is  a  Survey? 


ERIC 


1.    A  Data  collection  procedure  (to    e  distinguished  from  a  data 
analysis  procedure) .  - 

II.    A  detailsd  investigation,  mapping,  or  inspection  to  enum-arate 
or  observe  characteristics  of  a  population. 


III.  &camples 

1.  Survey  of  wildlife  in  a  region 

2.  Survey  of  objects  on  a  desk 

3.  Survey  of  rocks  in  a  soil  sample 

4.  Survey  of  opinions  held  by  residents  of  a  city 

iV.    Major  forms 

1.  Census— complete  survey:    every  object  in  the  relevant; 
population  is  iiivolv^id. 

2.  Sample  survey — partial  survey:    members  of  the  population 
are  selected  and  the  entire  population's  characteristics 
are  inferred  from  the  sample. 
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topic    2,    eoraponerr's  of  a  Survey 
I  •  instrument 

1.  Observation  rule 

2.  interview  topics 

3.  Questionnaire 

4.  Continuous  record 

il.    Fielding  procedure 

1,  interview  structure 

a.  Open-ended 

b.  Structured 

c.  Item  response 

2.  Data  collector 

a.  Interviewer 

i .  Face-td-Face 
ii.  Telephone 

b .  Self -administered 
i.  Quest:ionnalr<» 

ii.  Diary 

c.  Unobtrusive  observer 
i.  Participant 

ii.    Mechanical  recorder 

III.    Data  recording  and  reduction 

1.  Items  on  schedule 

2.  Coders 

3.  Direct  to  computer 

4.  Machine  readable  forms 
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IV.    Analysis  plan 

V.    San^ling  procedure  (for  sample  surveys) 
1;    Ad  hoc 

2.  Arbitrary 

3.  Probability 

4.  Over sampling 

VI.    Overall  design 

1 .  Cross-section 

2.  Panel,  successive  samples 

3.  Snowball 

4.  Multiple  questionnaires  (for  different  cat:';^orie8  of 
respondents) 

5.  Multiple  linked  itemt 

6.  Timing  ^ 

VII.    StaTf  requirements 

1.  Administrative 

2.  Clerical 

3.  Field 

4.  Scientific 

a.  Questionnaire  design 

b.  Sampling  procedure 

Vill.     Sequence  of  activities 

Issue  defined      population  defined      instrument  designed 
instrument  tested      sample  designed  ->  sample  selected  ^ 
instrument  fielded      data  returned  -^^  data  coded  ^  data 
cleaned  and  reducec.      analysis  commences 

(Potential  biases  and  errors  occur  at  each  stage.) 

EKLC 


Module  III 

Topic  3.    Motivation  for  conducting  a  sample  survey 
I.    Nature  of  data— must  interact  with  people 
1 •    Opinions ,  attitudes ^  experiences 

2.  Past  unrecorded  actions 

3 .  Enumeration 

4.  Behavioral  intentions 

5 •     Legislative  requirements  (U.S.  census) 

II.    Why  sample? 

1.  Cost 

2.  Ef f iciency--not  all  observations  heeded 

3.  Necessity-- not  all  population  members  available 

III.     tftiat  purposes  can  a  survey  serve? 

1.  Describe  a  population 

2.  Test  hypotheses  and  theories  about  behavior 

3.  Deduce  goals,  interests  or  desires 

4.  Evaluate  programs 
!: .  ecast  outcomes 

IV.    Problems  with  surveys 

1.  Cost 

2.  Interaction  required — obtrusive 

3.  Time  consuming — for  respondent  and  collection 

4.  Error  prone 


r 
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Lecture  7-i.    Survey  Design 

Survey  Desiin:  Designing  instruments  and  fielding  procedures  for 
administering  sample  surveys 

Lecture  Content: 

1.  Concerns  of  the  survey  designer 

2.  Examples  of  surveys 

Main  Topics; 

1.  Respondents 

2 .  Ques t  ionna ir e 

3.  Interview 

4.  Exainp\es 

(There  are  no  transparencies  for  this  lecture.) 
Reference:    Warwick  and  Lininger,  Chapters  6-8 
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Topic  !•  Rc'5?pondenti5 

li    Who  is  ro  I  '  sui  -eyed  -who  is  the  survey  about? 

1.  i^u^vl 

2.  HeauS  o  f  ao^. .  M;^ 

3.  Income  earner* 

4.  Parents 

5.  Participants,  in  a  particular  program 

6.  etc. 

II.    Where  are  the  respondents  located? 

1.  Geographically 

2 .  Sbcioeconomically 

3.  Behaviorally 

III.    Where  will  they  be  interviewed? 

1.  Residence 

2.  At  program  site 

3.  On  the  street 

4.  in  store 

IV.    What  impact  do^3  nature  of  respondents  have  on  survey? 

1.  Language 

2.  Types  of  questions  that  can  be  asked 

3.  Timing 

4.  Access 

5 .  Security 

6.  Response  rate — cooperativeness 
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topic  2.    Quest ioxmaire 

I.    What  controls  are  required? 
1  •  Age 
3.  Sex 

3.  Race 

4.  Ethnicity 

5;    Family  type  and  size 

6.  Marital  status 

7 .  Income 

8.  Occupation 

9.  Education 
10.  Others... 

II.    What  indicators  can  be  used? 

1.  Duncan  scale  of  occupational  prestige 

2.  U.S.  Bureau  of  Labor  Statistics  or  Cevisus  Bureau  definitions 

3.  Review  other  used  measures  (may  be  able  to  contrast  results) 

III.     Interview  situation 

1.  Problems 

a.  Tho:te-'-  selective f  shorty  unknown,  respondent 

b.  Self  administere'i— t;hc  really  did  it? 

c.  Face-to-face—interviewer  training 

2.  • Advantages 

'   a.    Mail — cheap 

b.     Phone — cheap,  fast 
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IV.    Open  versus  closed  response 

1.  Closed— prompts  meaning i  limited,  category  coverage 

2.  Open~difficult  to  quantify  irrelisvarit  responses— -lack  of 
verbal  ability  but  gets  spontaneous  and  unexpected 
information 

3.  eombination  of  open  and  closed 

V.    Question  writing— objectives 

1.  Simplicity— for  interviewer  and  interviewee 
includes  structure^  vocabulary ^liad  responses 

2.  Spec.  ^  i      /—single  issue  foctis 

3.  kvo±K   ^>  I   aerations— biases  and  prompts 

4.  Permit  catch-all  category 

5.  Construct  appropriate  context 

6.  Depersonalize  answers 

7.  Make  relevant  to  respondent 

8.  Voice  in  respondent's  style 

9.  Balance  questions  positively  and  negatively 

lb.    Avoid  overly  consistent  response  categories  and  sequence 

11.  Avoid  extreme  statements 

12.  Build  in  consistency  checks 

is.  Construct  effective  flow  and  branching 

i4.  Construct  simple  layout 

is.  keep  size  to  minimum 

16.  Provide  handouts  for  complicated  answers 

952 
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topic  3.  Interview 

1.    Use  trained  interviewers 
Chapter  8) 


■  >    .net?    Warwick  and  Linirger, 


a.  Presentation  of  Interviewer  should  be  natural  and  unobtrusive 
lii.    Perform  random  checks  on  Interviews  (by  cal}:'-!^.  stc.) 


iV.  Examples 
1. 


Choose  two  sairveys,  one  of  odbt  quality  (such^as  a^  _ 
magazine  self-report  questionnaire)  and  one  of  professional 
quality  (such  as  one  administered  by  the  National  Opinion 
Research  Center  or  the  iustrument  appearing  on  pages  172- 
iSl  of  Warwick  and  Lihinger.) 

2.  Make  certain  that  students  have  copies  of  these  survey 
instruments 

3.  Hav.-   students  administer  the  instruments  (or  parts)  to 
one  another 

4.  Discuss  positive  and  negative  features  of  the  questionnaires 
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tecture  7-2..  Sample  Design 

Sample  Design  for  Surveys— The  use  of  statistical  procedures  for  se- 
lecting respondents  and  estimating  errors. 

Lecture  Conten£; 

1»    Types  of  sapling  procedures 

2.    Statistics  for  simple  random  sampling 

Main  Top  tea: 

1.  Review  motivation  for  sampling 

2.  Sampling  procedures 

3.  Probability  sampling 

(There  are  no  transparencies  in  this  lecture.) 
Reference:    Warwick  and  Linir.£>er,  Chapters  4-5 
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Topic  i.    Review  tabtivation  foi  sampling 
I.    W-y  sample? 

.u.    v*n  all  popuiatioh  members  be  interrogated? 

2.  Naed  all  population  members  be  interrogated? 

3.  Cost  of  complete  census  may  be  too  high 

4.  Adequate  level  of  precision  may  be  reached  with  sample 

5.  Sample  may  be  better  than  census—ask  more  questions  of 
fewer  people 

II.    What  is  role  of  sampling  procedure? 

1.  Select  ir/iividuais  Co  interrogate 

2.  Provide         anism  for  r^st.imating  error 
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2.    Sampling  procedures 

Probability— individuals  selected  by  chance  mechanism  with 
certain  khoxim  probabilities  of  inclusion 

1.  Simple  random  sampling— equally  likely  and  independent 
inclusion  probabilities 

2.  Many  variations  (t^on-equai  probability) 

a.  Stratific^a 

b.  Clustered 

c.  Multistage 

11.  Non-probability 

1 .  Haphaasard 

2.  Judgmental — interviewer  determined 

3.  Quota — categories  outside 

4.  Experts — (paid  by  interviewer) 

5.  Purposive 

(Note:    Non-probability  methods. do  hot  permit  est taating  . 
errors  in  in  feeing  features  of  the  population  from  charactei 
tics  of  the  sample—thus  it  canhpt  be  known  what  size  sample 
required  to  obtain  some  specified  level  of  precision.) 


Topic 
i. 
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Topic  3i    Probability  Sampling 
1.  Structure 

1.  tist:    Inventory  of  population  units 

2.  Sa^iiig  units:    Actual  sampling  basis 

3.  Fr^e:    Operational  procedure  to  account  for  population 

II.    Types  of  errors 

±.    instrument  measurement  error 

2.  interviewer  bias 

3.  Sampling  error—this  we  can  quantify  in  terms  of  a  con- 
fidence interval  around  a  mean  response: 

a.    Once  we  sample^  response  is  a  random  variable 

6.    When  variance  and  distribution  of  random  variable  are 
Icnowh,  confidence  interval  for  the  mean  can  be  obtained 

c.    Stating  a  confidence  level,  we  can  obtain  an  estimate  o 
the  heeded  sample  size 

III.    Simple  random  samples—equally  likely  and  independent 

1.  All  units  chosen  individually 

2.  All  units  have  same  chance  of  being  chosen 

3.  Selection  of  one  unit  does  not  prejudice  selection  of  any 
other 

4.  Various  mechanisms  from  list 

a.  Random  number  table 

b.  Computer  pseudo-random  numbers 

c.  Mechanical  devices 
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IV*    Introduction  to  statistics  for  simple  random  sampling 

Note:    For  a  more  extensive  treatment  of  sampling  the  instructor 
should  consult  Kish^  L.,         Samfil^u-Survey ,  New  York: 
Wiley,  1965 • 

(Notation:    upper  case  letters  refer  to  population 
lower  case  letters  refer  to  sample) 

When  a  survey  is  to  sample  attributes  (numerical)  in  a  popu- 
lation we  are  interested  in  several  issues  (use  binomial  if 
attribute  is  dichotomous  or  Normal  approximation  to  the 
binomial) : 

1.  We  will  examine  average  opinion  in  the  sample,  x. 

2.  From  the  stamp le  average  we  will  infer  the  population 
average,  X. 

3.  eiven  a  level  of  precision  for  this  inference,  we  will 
specify  the. sample  size,  tl. 

4.  To  select  _n  responses  randosily  we  will  need  an  inventory 
or  list  of  the  population,  N. 

Since  we  are  sampling  the    population,  the  s^ple  mean  obtained 
from  one  sample  is  only  one  of  many  possible  means  of  similar 
samples,  i.e.^  it  is  a  random  variable  with  expected  value 
E(x>  »  X  and^  in  SRS,  is  distributed  N(X,a2/n).  That  is,  it  is 
Normally  distributed  and  an  unbiased  estimate  of  the  population 
mean. 

Note  that  x  is  Normally  distributed  even  if  the  distribution 
of  the  attribute  being  sampled  is  not  Normal  in  the  population 
(^cept  in  cases  where  n  <  30) . 

V.    Confidence  intervals  for  the  population  mean 

1.    Variance  of  population  and  sample. 

«  2:(X.-X)^ 

«  ^   population  variance 

^  i:(x.-x) 

«       ^      -  sample  variance 

Using  sample  variance  as  estimate  of  population  variance 
we  use  N-n  32  and  note  that  as  N  becomes  large  this  cor- 
rection ^    become  very  small. 

Bb3 
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2     Variance  of  Sample  means  frou.  similarly  s±zed_ and  drawn  ^ 
slSSr(?ef erred  to  as  the  square  of  the  standard  deviation 
of  the  sample  means  or  the  standard  error) . 


3.    Confidence  intervals 


^  ^ 

Theoretical  Value:       °5  ^  ^  "iT  - 


n 


Estimate  using  sample  variance! 


^-:z     \'  N-n  s 
=  ^  T  "n" 


X 


95%  confidence  interval  for  X  is  = 


i  ^  2.6s- 


99%  confidence  interval  for  X  is 
X  +  2.6s-  i  etc. 


X 


4 .  Example 

Perform  an  experiment  with  the  class  by  asking  all  to 
rlcoTtheiHge  to  the  tenth  of  a  year.    Sample  the  group 
rEdoifana  draw  a  sampl.  of  ten.    Compute  ^^^tistics  for 
constructing  a  confidence  interval  for  average  class  mem- 
ber's age  from  actual  average  age. 

Sample  size 

Examine  the  equation  for  the  estimate  of  the  standard  error. 


and  discuss  the  relative  l«q.act  on  error  reduction  that  occurs 
by  increasing  proportionate  sample  size.  N_n  .  and  absolute 
sample  size»  ^  • 

Note  the  relative  efficiency  of  increasing  absolute  size. 

Since  absoluti  ilze  increases  reduci  s-  by  1/^  note  f^^^""^ 
1^^  decreasing  percentage  Improvements^as  the  ratio  of^sample 
Ke  to  population  increases.    The  typical  national  sample 
survey  uses  n  between  1500  and  2500. 
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Vll.    Bbdlf Icatlbns  of  SRS 

In  advanced  classes  discuss 

1.  Stratlflcatibn—divide  population  into  strata 

2.  Clufiterlhg-- elements  chdseti  as  groups 

3.  Systematic  selectibri~u8e  selection  Interval 

4.  Unequal  probablllty--welght  selection  probabilities 

5.  Multistage  sampllhg-- selection  Involving  two  or  more 
successive  stages 

Reference:    Warwick  and  Llnlnger,  pp.  95-110  and  Chapter  5. 
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Homework 
Unit  7 


Write  a  short  essay  discussing  the  merits  and  disadvantages  of 
using  sample  surveys  to  collect  data  for  policy  analysis. 

Discuss  the  me^iti  and  disadvantages  of  using  archival  survey  data 
for  policy  analysis. 

Design  an  interview  schedule  to  be  administered  in  36  minutes  to 
obtain  information  relevant  to  one  of  the  following  urban  policy 
issues: 

i;  Public  transportation 

2.  Hospital  care  of  the  elderly 

3.  Prenatal  care  of  welfare  mothers 

4.  Satisfaction  with  garbage  and  sanitation  services 

5.  Air  pollution 

Be  certain  to  include  relevant  control  variables  and  discuss  your 
planned  analysis  in  terms  of  policy  amenable  independent  variables. 

Using  the  questionnaire  designed  in  the  prior  reading  assignment 
design  a  sampling  procedure  in  which  a  1%  SRS  sample  would  be  drawn 
from  an  urban  population.    Discuss  the  stages  through  which  the^ 
sample  is  actually  drawn.    Choose  five  (^lestions  and  est in^^te  their 
standard  errors  as:     .01,  .05,  .1*  .2,  .7.    Compute  confidence^ 
intervals  for  population  attributes  sampled  by  each  question  using 
Sampling  fractions  of  90%.  75%,  50%,  10%,  1%,  .01%. ^Assume  the 
city  has  a  population  of  500,000.    Assuming  a  cost  of  $50A 
completed  interview,  discuss  the  comparative  merits  of  each 
sampling  design. 
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Quiz 
Unit  7 

Name;_ —   

Please  read  this  quiz  thoroughly  before  wrttitig  answers  to  a^ 

the  questions •    Make  your  answers  brtef  and  to  the  point.  Excessive 

ybrdiness  and  rambling  responses  will  detract  from  your  total  score. 
You  have  thirty  (30)  tnimxtes  to  answer  this  quiz. 

Sxamine  the  foiibwing  questionnaire  that  appeared  in  the  national 
nagazine  Ma  and  was  meant  to  be  filled  in  and  r^urned  by  Ms.  readers, 
(bo  not  answer  these  questions.    Quiz  questions  begin  on  the  third  page.) 

First  National  Television  Test  on  Sexual  Attitudes  (from  Ma.  Magazine) 
FAct  Questions 

1.  Men  are  more  aggressive  than  woiaen. 

True     □  False  0 

2.  Single  men  aire  psychologically  better  off  than  married  men. 

True     □  TalBB  Q 

3.  Host  women  are  sujpported  by  men  and  therefore  work  for  luxuries 
not  necessities. 

True     □  False  Q 

4.  The  average  full-time  male  worker  earns: 

□    70Z  More  Q   38%  More  □    18%  More 

than  his  female  counterpart. 

5.  Of  all  girls  born  in  1977,  what  percentage  will  work  outside  the 
home  during  their  lifetimes? 

13   33%  □   55%  □  90% 

Opinion  Qiieat^otis 

1.    If  you  could  send  <miy  one  child  to  college,  would  you  send: 

your  son   Q        7°"^  daughter   Q        your  oldest  child  Q 

982 
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The  iore  asiertive  and  Independent  a  woman  is,  the  less 
attractive  she  is  to  laen; 

agree   Q  disagree  Q 

A  woman  who  decides  not  to  have  children  is : 

□  Missing  one  of  life's  greatest  satisfactions 
Unfeminine 

3   Fulfilled  in  other  equally  valuable  ways 
^   Probably  physically  unable  to  have  them 
Wiat  gives  you  the  most  satisfaction? 
pi   Family  C]  Running  a 

□  Love  Life  _  Q  Career 

Who  gets  the  better  deal  in  this  society? 
Men   Q  Women  \~\ 

  END  OF  QUESTIONS  THAT  APPEARED  IN  MS.  MAGAZINE 


9ro 


XVI. III. 162 


Module  111 


QMPM  giij.g  gtlaBg  j^Qisg  t 

i.    Cdiiment  oil  the  split  of  these  10  questions  by  Ms^  magazine  Into 
facr  and  opinion^    Are  any  of  the  questions  "double-barreled"  or 
applicable  to  on}y  a  fractionof  the  respondents?    Are  the  ques- 
(ions  "loaded"?    Are  the  answers  allowed  for  each  question  suf- 
ficient and  accurate?    What  other  comments  can  you  make  about  the 
nature  of  the  ijuest^tons? 


2.    Given  a  large  response  rate ,  what  conclusions  can  be  drawn  by 
m  investigator  from  individual  responses  to  these  10  questions? 
What  can  one  say  about  national  "sexual  attitudes"  from  this 
survey.    What  qualifications  must  be  made  when  generalizing 
from  this  survey  and  why? 


3.    Assume  ybtl  have  $150^000  to  dipend  on  a  national  survey  of  sexual 
attitudes.atid  that  the  questionnaire  has  been  designed  and  field 
tested.^  Describe  a  workable,  reliable  and  efficient  sampling 
and  Implefflehtatidti  strategy.    Be  sure  to  discuss  whether  clustered 
or  stratified  sampling  should  be  employed  and  the  nature  of  the 
field  work. 
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Quiz  Solutions 
Unit  7 


There  were Itg  aspects  of  the  survey  vhich  could  he  criticized, 
including  the  following i 

(a)  ibe  distinction  between  fact  and  opinion  is  arbitrary. 

(b)  The^e  was  no  response  alternative  for  "don't  know"  or  "no 
6pinicm«" 

(c)  Many  of  the  words  are  loaded,  such  as  "aggressive"  and 
"assertive" . 

(d)  Many  of  the  words  are  subject  to  interpretation,  such  as 
"psychologically  Better  off." 

(It  asks  two  different  questions.) 
tnfluence  the  responses. 

^I  cbllitJilbii  than  the  alternatives  listed. 
it  tSe  SatlSaal  ^opalatlon.  '""J  ili  and  sex  and  iocit±5n 

^^^^^^^ 

attitudes  on  the  basis  of  this  survey. 
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There  are  inany  correct  angers  to  this  question;  errors  are  apt 
to  be  of  omission,  not  commission.    In  deisignihg  your  survey 
strategy,  did  you  consideriii 

(a)  who  the  population  is  that  you  wish  to  generalize  about? 

(the  t^ble  country?  adults  only?  sexually  active  adults  only?) 

(b)  how  large  a  sac^ple  to  have,  in  either  absolute  numbers  or 
percentages? 

(c)  what  sampling  strategy  touse  (cluster?  stratified?  simple 
random  sample?)  and  the  relative  advantages  of  the  strategy 
you  chose?    Remember  the  large  scale  of  a  national  survey, 

(d)  how  to  administer  the  survey  (in  person?  by  phone?  by  mail?) 
and  the  relative  advantages  of  your  choice? 

(e)  the  potential  (^arrassTCnt  to  respondents,  particularly  in 
a  face-to-face  interview  given  by  someone  of  the  opposite 
sex? 

(f)  the  cost  of  your  strategy? 
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Module  IV 


Introduction  to  Module  IV 

Overview 

Module  IV  of  the  Quantitative  Methods  for  Public  Management  pack- 
age contains  two  units,  numbers  8  and  9.    Unit  8,  Two-way  classifica- 
tions for  continuous  data,  introduces  the  student  to  the  construction 
of  models  for  summarizing  continuous  data  arrayed  in  a  two-way  table, 
a  table- type  data  structure  quite  common  in  public  policy  studies. 
Three  variables  are  involved,  two  fi5tors  and  a  response.    The  general 
strategy  is  to  fit  a  simple  additive  model  to  the  table,  compute 
fitted  values  and  residuals  and  exaSlne  the  quality  of  the  model.  The 
fitting  procedure  employed  involves  iterated  decomposition  of  the  table 
using  repeated  removal  of  medians  (i.e.,  median  polish)  or  means 
(i.e.,  mean  polish).    A  procedure  is  introduced  for  determining 
Whether  the  data  need  to  be  transformed  to  improve  the  appropriateness 
of  an  additive  model,    techniques  are  also  discussed  for  handling 
ordinal  levels  in  the  factors  and  for  constructing  a  model  with  an 

interaction  term. 

Unit  9,  Pisetet^  Mul^tivariate  Anaylsis,  introduces  the  itudent  to 
the  analysis  of  contingency  tables,  another  table-type  data  structure 
common  in  policy  studies.    The  data  in  this  case  are  discrete  frequen- 
cies, counts  of  the  simultaneous  occurrence  of  two  or  more  conditions. 
The  question  posed  by  analysis  is  whether  or  not  the  table  provides 
evidence  of  independence  in  the  variables.    The  strategy  is  to  intro- 
duce students  to  contingincy  tables  via  traditional  test  for  goodness 
of  fit  in  one  dimensional  tables  and  then  to  develop  log-linear  models 
in  the  analysis  of  higher  dimensional  tables. 
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^eclfle  Objectives 
Unit  8 

Upon  successful  completion  of  Unit  8  a  student  will  be  able  to 
recognize  conttnuous  data  that  can  be  arrayed  iit  a  two-way  layout  and 
will  be  able  to  analyze  the  data,    ^aiysis  could  include  the  construc- 
tion of  elementary  additive  models  using  median  or  mean  polish^  com- 
putation of  comparison  values  and  construction  of  diagnostic  plots ^ 
identification  of  data  requiring  transformations,  selection  of  appro- 
priate transformations,  construction  of  displays  of  coded  residuals, 

evaluation  of  the  fit  of  the  models  plots  of  effects  for  ordinal 

■1* 

factors,  development  of  extended  summaries  for  ordinal  factors,  and 
development  of  extended  models  incorporating  an  interaction  term. 

Unit-9 

Upon  successful  completion  of  Unit  9  a  student  will  be  able  to 
identify  data  ^ich  can  be  analyzed  as  a  one,  two^  or  more  dixaehsional 
contingency  table.    Analysis  will  include  determination  of  appro- 
priate probability  models^  construction  of  cross-product  ratios^  com- 

   :  ^     2  ----        _    _  _ 

putation  of  Pearson's  x    test  for  goodness  of  fit  in  the  case  of  a 
one  dimensional  tables  construction  of  log-linear  models  in  higher 
dimensional  tables^  tests  for  Independence  of  variable  and  for 
interactions.     Students  will  have  obtained  experience  in  constructing 
log-linear  models  for  frequency  data  arising  in  commonly  reported 
tables  such  as  opinion  surveys  and  censuses. 
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Prerequisite  Inyentbry 
Units  8  and  9 


In  this  module  we  analyze  data  which  come  In  two  forms.    The  data 
in  Unit  8  are  two-way  tables^  which  relate  one  Y  and  two  X  variables. 
In  Unit  9  we  look  at  contingency  tables^  which  list  the  number  of  ob- 
servations in  the  different  categories  of  one  or  more  variables. 

Comprehension  of  Module  I  Is  assumed.     Stem-and-leaf  displays  and 
medians  are  topics  covered  In  Module  I  that  are  also  used  In  this 
module.    The  topics  In  this  inventory  are: 

1.  Review  of  Nuiaber&;    Amounts  and  Counts 

2.  Review       Resistant- Lines 

3.  Review  of  Hypothesis  Testing  and 

4.  Data— S^rue^tures 

If  you  are  uncertain  about  any  of  these  topics  after  reading  this 
Inventory^  please  consult  a  member  of  the  teaching  staff.    Mastery  of 
this  material  is  essential  before  proceeding  to  Module  IV. 
Section  1.    Review  of  Types  of  Numbers 

In  Unit  1  four  types  of  numbers  were  discussed:  amounts^  counts^ 
bounded  numbers^  and  differences.     In  this  module  it  will  be  necessary 
to  distinguish  cbtints  from  amounts.    Two-way  tables  contain  amounts. 
Contingency  tables  contain  counts. 

Amounts  are  levels  of  a  variable.  Amounts  may  either  be  either  dis 
Crete  or  continuous »  btit  for  dtir  purposes  we  usually  think  of  them  as  con- 
tinuous.   Wheti  we  discuss  thousands  of  dollars  of  Income,  Income  can 

take  on  so  many  values  that  the  variable  is  essentially  continuous 
even  though  the  tallest  unit  it  can  be  expressed  in  is  .01  dollars. 
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As  another  example,  distance  is  a  continuous  variablis  and  56.34  miles 
is  an  ambtiht. 

A  count  is  the  number  of  observations  in  a  category.'.  Counts  take 
on  only  non-negative  integer  values.  The  xramber  of  people  in  the  U.S. 
with  income  greater  than  $20,000  is  an  example  of  a  count. 

Compare  these  2x2  tables: 

Average  income  (in  $)  of  N^ber  of  Transylvania 

Transylvania  residents,  by  residents,  by  race 

race  and  sex  and  sex 


Male 

Female 

Hale 

Female 

Black 

8,466 

8,000 

Black 

87,508 

88,981 

White 

9,666 

8,260 

White 

195,067 

198,216 

The  table  on  the  left  introduces  a  new  variable  (average  income) 
but  tells  nothing  about  the  number  of  people  whose  incomes  contributed 
to  the ^averages  oh  each  of  the  four  cells.    The  table  on  the  right 


tells  the  number  of  observations  in  each  of  the  four  categories  but 
introduces  no  hew  variable.    The  table  on  the  left  is  called  a  "Two- 
Way"  table  of  amounts;  the  one  on  the  right,  a  "Contingency  Table"  of 
counts. 

Section  2.    Review  of  Resistant  Lines 

A  clear  understanding  of  resistant  lines  is  important  for  two 
reasons:    many  of  the  concepts  used  in  describing  two-way  tables  are 
analogous  to  techniques  used  in  fitting  resistant  lines,  and  there  are 
relationships  in  two-way  table  analysis  that  are  best  described  by 
fitting  resistant  lines. 

A  resistant  line  is  a  fit  which  describes  the  relationship  of 
paired  (X^Y)  data.    If  X  and  Y  are  linearly  related  (in  raw  or  trans- 

O   .  =:  = 
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formed  units),  a  resistant  line  summarizes  that  relationship  with  a 
single  equation.    Unlike  least  squares  regression  lines,  resistant 
lines  are  not  much  affected  by  a  couple  of  points  which  deviate  from 
the  linear  trend; 

To  fit  a  resistant  line,  break  the  ordered  X's  into  thirds, 
carrying  along  with  each  X  its  paired  Y  value.    Calculate  a  condit- 
ional typical  value  of  X  and  of  Y  for  each  of  the  three  minibatches 
of  paired  values.    The  condltlbhal  typicals  will  be  the  (median  X, 
median  Y)  of  each  third,  although  these  pairs  may  not  have  been  paired 
among  the  original  N  ordered  pairs. 

Before  fitting  a  line,  check  to  see  if  the  data  need  to  be 
transformed.    To  do  this,  proceed  to  list  the  coriditidnal  typicals: 

Calculate  the  two  slopes: 

°i-  its- 

If  the  data  are  linear,  then  m^/in2  will  equal  1.    if  the  ratio  is 

<1,  transform  X  down  the  ladder  of  powers;  if  >  1,  transform  up  the 
ladder  of  powers.    You  need  transform  only  the  three  summary  points  to 
see  if  the  transformation  is  successful.    After  you  decide  on  the 
appropriate  transformation,  then  transform  ail  of  the  data. 

Once  the  data  are  linear,  the  next  step  is  to  rraove  the  tilt 

(or  slope)  from  the  line.    The  slope  Is  determined  by 

Y  Y 

m«  

where  X  may  now  represent  transformed  data. 
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.Rmdve  the  slope  by  rewriting  each       as  ^^^mX^.    The  new  con- 
ditional typicals  are 

The  level  (or  interceptj  of  the  line  is  the  median  of    Y^^  -  vC^, 

y    -  nX  ,  Y„  -  .     Subtract  the  level  from  each  Y  value.  Now 

M        BH     H  » 

we're  left  with  residual  =  Y  -  mX  -  level 

We  may  choose  to  polish  the  line  by  treating  the  residuals 
(i.e.,  Y  -  ^  -  level5  as  a  hew  batch  of  Y's,  repeating  the  fitting 
procedure  described  above,  and  adding  the  polished  fit  to  the  orig- 
inal fit.    We  calculate  a  new  batch  of  residuals  from  the  polished 
fit  and  may  polish  again  if  we'd  like  to.    The  decision  to  polish  is 
usually  based  on  the  appearance  of  the  residuals  (in  a  stem-and-leaf 
or  as  plotted  against  X) . 

Section  3.    Review  of  Hypothesis  Testing 

Many  aspects  of  probability  and  inference  are  utilized  in  Module 

IV.    In  particular,  you  should  feel  comfortable  with  hypothesis  test- 

2 

ing,  levels  of  confidence  (a5,  and  x  distributions. 

In  bvpotfcesis  testing,  we  establish  a  hull  hypo^eels^,  called  H^, 
which  we  express  in  quantitative  terms.    The  null  hypothesis  is 
geherally  a  supposition  about  a  populatioh  parameter.    We  then  do  what- 
ever analysis  is  appropriate  to  the  hypothesis,  based  on  a  sample  from 
the  population  in  question  and  on  the  assumption  that       is  true.  If 
our  analysis  leads  to  conclusions  that  are  "unlikely",  we  reject  H^, 
i.e.,  conclude  that  it  cahhot  be  true.    Otherwise,  we  do  hot  reject 
H^,  i.e.,  conclude  that  based  on  our  analysis       could  be  true. 
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The  decision  as  to  whether  or  not  a  result  is  "likely**  is  hot  a 
subjective  decision  but  is  based  on  probability.    We  cannot  be  correct 
ail  the  time,  but  we  can  decide  how  much  "being  wrong"  we  are  willing 
to  tolerate.    The  proportidn  of  times  we  expect  to  be  wrong  Is  a  s  arid 
1-  a  (times  lOOZ)  is  our  l#ve3^  of  confidence,  ebmaonly  used  levels  of 
confidence  are  90%,  95%,  and  99%. 

We  are  able  to  quantify  our  confidence  in  this  precise  manner 
because  of  our  knowledge  about  underlying  probability  distributions. 
For  example,  in  least  squares  regression  we  tested  the  hypothesis  that 
a  true  B  -coefficient  was  zero.    We  made  use  of  bur  knowledge  of  t 
distributions  to  determine  whether  the  sample  coefficient  was  likely 
to  be  hoh-zerb  *en  the  true  6  «G. 

_   2  -  -  - 

In  Unit  9  we  will  use  our  knowledge  of  thex     distribution  in 

hypothesis  tests.    A    X   random  variable  is  defined  as  the  sum  of 
squared  normal  random  variables.     It  is  characterized  by  one  para- 
meter, its  degreea  ^f^reedom.    In  theory,  degrees  of  freedom  are 
determined  by  the  twmber  of  normal  random  variables  which  are  squared 
to  form  the    X^;  In  practice,  we  will  figure  out  the  degrees  of  free- 
dom from  the  number  of  variables  and  ntmber  of  observations  in  our 

^  2         -  - 

data.    Just  as  we  used  t-tabies  and  Z-tables,  there  are  x  tables 

  _  2-  " 

which  tell  the  probability  with  which  a    x     random  variable  takes  on 

values  within  specified  regions. 
Section  4.  ^ataStractures 

Most  of  the  data  that  we  have  looked  at  so  far  have  been  either 
one-dimensional  or  two-dimensibhal.    <5ne-dlmehsiohal  data  are  typically 
single  batches  of  data,  written  as  a  list  of  numbers,  a  vector,  or  an 
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nxi  or  Ixntable*  Examples: 

Nuiber  of 
Phyatclatts 


Quiz 
Grades 


Census 
Tract 


1 

80 

99 

2 

 43 

97 

3 

5 

86 

72 

Paired  (X,Y)  data  for  multiple  regression  analysis  may  be  thought 
of  as  having  two  dimensions,  one  dimension  for  each  variable.    Along  one 
dimension  arc  the  variables  (e.i-,  income,  age)  and  along  the  other  is 
whatever  characterizes  the  observations  (e.g.*  census  tract,  city).  For 
example,  the  Hospital  insurance  data: 


Children's 
Beth  Israel 
McLean 
Mt.  Auburn 
Deaconess 


old 
premium 

866 
833 
255 
162 
435 


new 
premlmi 

646 
635 
218 
148 
348 


illustrate  a  set  of  5  paired  observations  arranged  in  a  5x2  table. 
Now  consider  the  following  tables. 


Male 
Black  White 


Female  : 
Black  White 


Age 


<4D 
»40 


<  49 
X  40 


Number  of  college  d^ns 
by  age,  race,  and  sex 


Module  IV 


Each  of  the  eight  cell  entries  repreisents  an  observation  across  the 
three  dimensions  of  age^  race*  and  sex.     If  we  could  present  such  tables 
in  three  dlmensibhSj  we  would  have  done  so,  placing  one  6£  the  tables 
on  top  of  the  other.    Since  we  have  to  present  the  data  oh  two-dimen- 
sional paper ^  we  placed  the  tables  next  to  each  other-    The  decision 
to  split  into  separate  tables  on  the  basis  of  sex  was  arbitrary;  we 
could  as  easily  liave  written 


Age 


<  40 


Male 


Black 
vnilte 


Female 


i  40 
Male  Female^ 


Black 
White 


or  any  otie  of  four  other  combinations. 

With  ah  understanding  of  three-dimensional  data*  we  can  easily 
extend  our  knowledge  to  larger  diiensions-    Suppose  we  want  to  add 
region  to  the  college  dean  data.    Below  is  one  way  to  represent  the 
four^dimenslonal  data. 

:                   Male  _L_ 
 Northeast         Rftt1^>l         Midwest  ^est  

Black 


<  40 
>  40 


White 
Black 


^'♦^  White 


Black 
<  Whlti 


>  40 


Black 
White 


Female 
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Homework 

Prerequisite  Inventory,  Units  8  and  9 

Identify  tfie  a^ensiox^  of  the  following  tables    as  1*  2^  3*  or 
more  dlienstotts    aid  state  whether  the  cells  of  the  table  contain 
counts  (moaber  of  obiervations)  or  amounts  (variable}. 

a.    median  age  of  college  students,  by  class  and  college 

bi    enrbiiment  in  each  of  the  elementary  schools  in  the  city  of 
Pittsburgh 

c.    mmber  of  biue-collar  and  white-collar  workers  in  major  U.S. 
cities 


d.  number  of  demolitions  in  1976  by  building  type  and  census 
tract 

e.  nmber  of  patients  in  Philadelphia  hospitals*  by  hospital, 
illness,  and  age 

Answer  questions  2-10  as  briefly  as  possible. 

When  you  ecamine  a  batch  of  residuals*  what  are  you  looking  for? 

How  might  you  want  to  examine  residuals  from  a  fitted  line? 

What  values  can  a  count  take  on? 

What  type  of  data  do  you  fit  resistant  lines  to? 

identify  (median  X,  median  Y)  in  the  following  (X,Y)  batch: 

(3,  13) 

(5.  11) 
(6,  183 
(6;  10) 

How  many  steps  of  polish  are  necessary  when  fitting  a  resistant 
line? 

When  do  we  conclude  that  a  hull  hypothesis  is  true? 

in  hypothesis  testings  why  are  we  willing  to  be  wrong  some  of  the 
t^es  that  we  reject  the  null  hypothesis? 
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Homework 

Prerequlialte  Inventory,  Units  8  and  9 
Sdlutldns 


a5 

tvo'-dlsiehslbhal 

table 

of 

amounts 

ohe-^dlinehslbnal 

table 

of 

counts 

c3 

two-d  Imehs  lonal 

table 

of 

counts 

d) 

two-dimensional 

table 

of 

amounts 

e) 

three-dimensional  table  of  counts 

2.  Gaussian  shape^  centered  and  clustered  at  ssero,  very  few  outliers 

3.  Plot  the  residuals  against  X  or  Y  or  Y 

^  •  Ij)      2y  3y 

5.    Paired  (X^Y)  data  which  exhibit  a  linear  relationship  in  either  raw 
or  transformed  units 

e.     C5.5,  12) 

7.  It  depends  on  the  shape  and  size  of  the  residuals  after  the  original 
fit  (and  each  step  of  polish),  ana  on  whether  you  are  fitting  the 
data  by  hand  or  computer. 

8.  We  never  conclude  that  a  null  hypothesis  1^  true;  we  conclude  that 

it  could  be  true  if  for  a  specified  level  of  confidence  the  truth  

of  the  null  hypothesis  could  lead  to  the  observed  sample  statistic (s) - 

9.  If  we  weren't. willing  to  be  wrong  some  of  the  time,  we  would  never  . 
reject  the  null  hypothesis.    We  cat!  never  know  a  true  population 
parameter;  but  we  can  decide  what  percentage  of  the  time  we  are 
willing  to  be  wrong. 
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__  Unit  8 
Reading  Assignments 

iiectore 

Keadlng 

8-0 

fukey^  Chapter  10,  pages  331-348 

8-1 

fukey.  Chapter  10,  pages  348-363 

8-2 

tukey.  Chapter  11 

8^3 

Singer-  "Exploratory  Strategies  and 
Graphical  Displays"    Journal  xs£ 

7,  pages  57-70 

in  addition,  please  read  the  following  article: 
Falrley  &  Hosteller,  pp.  23-5)0 

Texts: 

Falrley,  W.B.  and  F.  Hosteller,  Statist^s  and^blic  Policy.  Reading, 
Hassachusetts,  Addisbn-Wesley,  1977. 

Tukey,  J.^W.,  Exploratory  Data  Analysis.  Reading,  Hassachusetts,  Addlson- 
Wesley,  1977. 
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Lecture  8-0.     Introducatlon  to  Unit  8 
Ihtroductibn  to  Unit  8— Two-Way  Tables 
Lecture  Content; 

1.  Definition  of  Two-Way  Tables 

2.  Examples  of  Tvo-Way  Tables 
Main  Topics: 

1.  What  Is  a  Two-Way  Table 

2.  Examples  of  this  common  data  form 

3.  What  does  the  analysis  mean 

(There  are  lib  transparencies  for  this  lecture.) 
Reference:    Tukey,  Chapter  10 
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topic  i.    introduction  to  Unit  8— Two-Way  Tables 


1.    What  is  a  two-way  table? 

1.    A  rectangular  array  of  responses  laid  out  in  rows  and 
columns 


var  2 


1 
2 
3 


*  2.    Data  comes  as  triples 

3.  Variables  (factors)  1  and/or  2  may  be  ordinal 

4.  Response  is  numeric 
II.    Examples — common  data  form 

1.  Pittsburgh  food  data 

2.  infant  mortality  be  region  and  year 

3.  Others?    Unemployment  by  year,  reg. 
III.    What  does  analysis  mean? 

1.  Question:    what  effect  does  each  factoi  have  on  the 
response. 

Data  «  row  effect  +  column  effect  +  common 

Decbmpbsitioh  into  effects 
Use  residuals  for  evaluation 

2.  Question:    possible  role  for  taansformations? 
Analytic  procedure— Median  Polish 
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Lecture  8-i.    Analyzing  Two-Way  tables  of  Responses 

Analyzing  two-way  tables  using  median  poiish  (Slmpie  Fits) :  the  use  of 
median  polish  to  construct  simple  summaries  of  two  way  tables*  (.1) 

tecture  Content!  ^ 

1.  Discuss  simpie  model  for  two-way  table 

2.  Discuss  median  polish 
Main  topics; 

1.  two-way  tables 

2.  Simple  additive  summary 

3.  Median  polish 
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topic  1.    Structure  of  a  two-way  table 


factor  1 


Level  1 


Level  2 


Level  n 


Level  1 


Response 


Level  2 


Level  m 


I.    Simple  addltlvje 


"model" 


tt 


(25 
(35 


2. 


3. 


Data  «  Fit  (Response)  +  Residual 

Response  -  Contribution  (f^)  +  Contribution  (Fj)  +  Common 

Common  *  Typical  for  entire  table 

Row  fit  -  conditional  typical  on  row 

Column  fit  -  conditional  typical  on  column  . 

Row  effect  «  row  fit  -  common 

Column  effect  «  coltnnn  fit  -  common 

Tims , 

Response  -  Fit  -  row  effect  +  column  effect  +  common 


or 


-  row  f it  +  column  fit  -  common 


Module  IV 


II.    Elonentary  Analysis  (Because  we  have  only  an  additive  model)  (*) 


Fl  levels 

row 

row 

 eff 

fit  - 

F2  levels 

residuals 

coi.  eff. 

com 

(-com) 

col.  fit 

1.  Technique  (using  Means  or  Medians) 

Median  Polish:  Decomposition  of  a  two-^y  table  into  row 
and  column  effects  by  repeated  (iterated) 
removal  of  medians* 

(5) 

2 .  Procedure 


Eff 
Fit 


2 


Fit 


y 
> 


Details  of  Method 

a.  Get  row  medians  and  grand  median 

b.  Subtract  from  X^j  and  then  get  coiiSn  medians 

c.  eheck  row  medians:    all  zero? 

d.  if  no— subtract  row  medians  from  X^  and  get  column  medians 

e.  Check  column  medians:    all  zero? 

f.  if  nor-- repeat 

Yes— add  parts  (eff.)  and  common  to  get  fits 
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4.  fit:  Row  Effs  +  edl  Effs  +  common 

or  Row  Fit  +  Col  Eff 

or  Row  Eff  +  Coi  Fit 

Then,  Residual  -  Data  -  Fit 

5.  Construct— stem  &  leaf--check  for  symmetry 

Construct-Eieientary  analysis  table-check  for  non- 

additivity  by  examining  for  opposite  corner  sign 
pattern 

acamine  effects  and  fits  (6)  (7)  (8) 

&campie  1.    High  School  Grades  and  CPA 
Example  2.     Infant  Mortality  by  region 
ill.  Problans 


1.    Code  residuals 

Symbol  Residuals 


X 

X 

o 

b 


uh  +  step 

uh 

til 

Ih •+  step 


2.  Repeated  values:    take  cell  medians 

3.  Holes:     Skip-after  more  polish,  get  fitted  values 


^  ^  -J 
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liecture 
Outline 
Location 
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topic  1 
Section  I 

1. 
I. 

Section  II 

i. 

i; 
5. 
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Lecture  8-1  

Transparency  Presentation  Guide 


Transparency   

Nianber  Transparency  Description 

1  Lecture  8-1  Outline 


Effects  and  Common  in 
Multiple  Batches 

Two-way  table  of 
Responses 


Two-way  table. 

ElCTentary  Analysis 


5  Median  Polish:  Procedure 

6  Predicting  Freshman 

doiiege  Grades 

7  Median  Polish:  College 

Grades  1 

8  Median  Polish:  College 

Grades  2 
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Lecture  8-2.    Evaluating  Additlvity 

blagnostic  plots:    E^iuattng  the  adequacy  of  m  additive  model  as  a 

summary  for  a  two-way  table;'  (i) 

tectuge^  Coat  eat ; 

1.  Detecting  nonadditivity 

2.  Computing  comparison  values 

3.  Diagnostic  plots 
Transformations 

Main  Topics^ 

1.  Review  additive  model 

2.  Discussion  of  comparison  values  and  diagnostic  plots 
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Topic  1.    Reviev  additive  model 

I.    Stinmary  has: 

Data  *  Row  effect  -f  Goitmn  effect  +  common  -F  residual 

11.    Departures  from  addtttvity 

1.    Once  elementary  analysis  is  completed,  arrange  residuals 
in  "effect  order" 


residuals 


smallest 
ROW  EFFECTS 
largest 


(2) 


smallest  largest 
COLUllN  EFFECTS 


2.    Examine  effect  ordered  residuals  for  evidence  of  opposite 
comers  sign  patterns* 


or 
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topic  2.    eomparlsbn  Values  and  Dlaghbstlc  Plots 

I.    Comparlsbh  Values  and  Dlsghbstlc  Plots 

.      >.          j  ,  effect -V-cbliimn  effect  ^.^ 

1.  eomparlson  value  «   tor  eacn 

common   

cell  (3) 

2.  Plot  residual  x  comparison  value 


Residual 


Diagnostic 
Plot 


Comparison  vaxue 


3.    Note  that: 


a.  If  residuals  are  ail  around  zero  this  plbt  will  be  flat 

b.  if  residuals  equal  comparlsbh  values  br  equal  compar- 
ison values  times  some  constant,  there  Is  some  hon- 
additive  (multiplicative)  cbmpohent  in  residuals 

4.    Explore  for  non-additivity  by  putting  resistant  line 
through  plot 

5*     Flatten  dlaghps tic  plot  by  re-expressing  data  x];"™ 
using  ladder  of  powers  *J 

6.  Redo  entire  procedure  to  determine  if  re-expression  was 
effective 

7.  Note: 

a.  Weak  patterns  in  the  diagnostic  plot  will  not  re- 
express  well 

b.  Non-mbnbtone  patterns  require  more  complicated  fits 

Ci    1-m    must  be  Interpreted  loosely— it  is  a  guide,  an 
appr oxlma  t Ion 

8.  fccample:    HS  grades  and  Freshman  GPA  (4-8) 
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Lecture 
Outline 
Location 

Beginning 
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Section  II 
1. 

Topic  2 
Section  I 
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8 
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Transparency 
Number  


2 
3 


6 
7 
8 


Transparency  Description 
Lecture  8-2  Outline 

Residuals  in  **E££ect**  Order 


Comparison  Values  and 

Diagnostic  Plot- 
High  School  Grades  and 

Freshman  Grade  Point 

Average 

High  School  Grade  Data: 
Residuals  and  Compar- 
ison Values 

Diagnostic  Plot  of  Grade 
Data 

Diagnostic  Plot  of  Log 
(Grade  Data) 

Elementally  Analysis  of  Log 
(Grade  Data) 
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Lecture  8«*3.    Extending  the  model 

Extending  the  Hodel:    Stinniarlzlhg  effects  In  ordinal  data  using  fitted 

lines  and  developing  extended  fits  for  Interactions. 

Lecture  Content; 

1.    Discuss  summaries  for  effects 

2»    Discuss  Interactions 
Main  Topics; 

1.  Plotting  effects  to  construct  simple  summaries 

2.  Extended  fits 
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topic  1.    Plotting  effectis  to  construct  simpie  Buamaries 
I.    Simple  additive  model  for  categorical  data  (Review) 

Golu^ 
Factor  : 


Row 
Factor 

Response 

Rw 

Effects 

uoiumn  Effects 

conunon 

1.  Data  ■  Fit  +  Residual 
becomes: 

Response  »  cdiumn  effect  +  row  effect  +  comibn 

2.  Transf brmatidii  of  the  response  variable  may  be  required 
to  improve  additivity. 

Theti  model  is: 

(Response)'^  ■  column  ejEfect  +  row  effect  +  common 
and  the  data  structure  is  conceived  of  as: 


Row 
Factor 


Column 

 Factor 

(Response)'^ 

Row  __ 
Effects 

Coitram  Effects 

common 

3.    When  the  factors  are  categories  of  categorical  variables 
then  a  summary  formula  for  the  effects  is  hot  possible 
and  each  effect,  one  for  each  category  of  each  factor, 
must  be  represented  in  the  model. 

XVIilViAO 
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ii.    Additive  laodel  with  ordinal  data 

1.  inieii  factors  have  quantitative  levels,  i.e^,  ordinal  data, 
then  it  is  possible  to  consider  fitting  a  model  with  sum- 
maries for  effects. 

2.  Generally,  we  can  try  to  fit 

(Response)^  «  f (factor  1)  +  f (Factor  2)  +  common 

wherethe  right  hand  functions  are  linear  or  linear  through  a 
transformation.  ; 

(Note  that  this  iff  the  most  general  representation.    It  is  not 
necessary  for  the  response  to  be  transformed  or  for  both 
factors  to  be  ordinal.) 

3.  An  alternative  representation: 

r  r^ 
(Response)      «  (aj+  b^F^  ^)  +  (a2+  b^Fj    )  +  common 

(Note  that  this  assumes  that  the  effects  for  each  factor  can 
be  summarized  as  a  linear  function  of  the  factor's  levels.) 

III.  Finding  summaries  for  effects  of  ordinal  variables. 

1.    Plot  effect  against  level  using  x-axis  for  level.     (One  plot 
for  each  factot.) 

Effect 


Factor 

2.  If  it  appears  to  be  reasonable,  fit  a  line  to  the  scatterpiot 
(transform  factor  levels  if  this  is  required.) 

3.  Use  the  equation  obtained  as*  the  summary  of  effects  for  the 
factor. 


IV.    Example:    College  grade  point  average  as  a  response  to  sex  of 
student  and  high  school  average  grade 

1.  Simple  additive  and  logged  fits  (effects  only  are  shown) 

2.  Plot  of  row  effect  against  row  factor  (high  school  grade.) 
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line:    Row  feffect  -  -.56  +  .23  (HSG)  (3) 
Hbdel: 

GPA  -  (-.56  +  .23  (HSG))    +  (Sex  Effects)  *  cotmnbti 
3.    Plot  or  row  effect  against  row  factor  (logged  response); 
line:    row  effect  -  -.12  +  .04  (HSG) 
Model: 

log  GPS  -  (-.12  +  .04  (HSG))  +  Sex  Effect  +  common 

Ibcainfile;    Moody  bonds—net  interest  as  a  response  to  year 
and  grade. 

1.  Original  data  and  effect  analysis  (4) 

2.  Plot  of  row  effects  against  row  factor  level  (year) . 
Line;    row  effect  -  -.50  +  .30  (year  -  1964)  (5) 

3.  Plot  of  colimn  effects  against  column  factor  level 
(bond  grade) . 

Line:    coimai  effect  =  -.33  +  .26  (grade)  (5) 

4.  Model 

Net  interest  -  (-.50  +  .30  (year  -  1964))  +  (-.33  +  .26  (grade)) 
or 

Net  interest  -  .30  (year  -  1964)  +2.6  (grade)  +3.70 
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Topic  2.    Extended  Fits 
I.  Purpose 

1,  To  Include  an  interaction  effect 

2.  To  Improve  addltlvity  where  transformations  do  not 
make  sense 

II.  Model 

Response  «  Row  Effect  +  eoiain  Effect  +  common 

^Row  effect)  (Cblumn  effect) 
^  common 

where  k  Is  slope  of  a  line  through  the  diagnostic  plot 
III.  Procedure 

1.  Perform  elementary  analysis 

2.  Construct  a  diagnostic  plot 

3.  Fit  a  line  and  find  k 

4.  Compute  difference  Between  residuals  from  elementary 

analysis  and  k  ^^jir-  "®  residuals 

com 

5.  Contrast  Improvement  by  computing  sum  of  absolute 
residuals 

6.  Compute  fitted  values  from  basic  model 

IV.    Construct  example  using  college  grade  point  average  data. 
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Lecture  8-3  Outline 
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Row  effects  plotted  for  raw 
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Section  V. 
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Average  net  interest  costs 
for  bonds 

Row  and  column  effects  plotted 
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Homework,  Unit  8 

i.    Average  interest  rates  by  loan  size  and  geographical  region  appear 
below.     (Units  are  Z  per  year). 


location 

$iobb 

size  of 
$7000 

loan 

$10000 

$30000 

$1000000 

New  York  City 

10.0 

9.0 

8.5 

8.3 

7.8 

South  and  West 

10.8 

10.4 

9.8 

9.1 

8.6 

Nbr'th  and  East 

10.9 

10.3 

10.2 

9.1 

8.6 

(Excluding  NYC) 

(a)    Median  polish  this 

table,  and 

comment  on 

residuals. 

(b)  Compare 

the  three  locations  by  examining 

the  location 

effects. 

(c)  HOW  do  the  interest  rates  vary  with  loan  size? 

(d)  Suggest  some  reasons  fr-  « -ked  effects  across  loan  size  and 
region—assume  loan  mo:  cOTooJity  sold  in  a  market. 

(e)  Assume  that  you  are  the  manager  of  a  large,  non-profit 
rehabilitation  oSga:  ''f              .Manchester    New  Hampshire.  The 
organization  has  dc de';        ,  .»struct  a  ui^  ^^""^1^1  L^«^f°'' 
offices  and  recreaticnal      .iUties^    A  ?.o^n  oi  $100,000  is 
required  to  finance  coSstraccicn.    Ydm  board  of  directors  suggests 
that  many  small  loans  be  made  iti  the  Manchester  area  so  that 
local  financial  institutions  will  benefit.    But  you  are  con- 
cerned with  the  organization's  growing  indebtedness  and  suggest 

a  different  strategy  to  minimize  cost.    What  is  the  strategy? 
What  arguments  would  you  use  in  support  of  your  position? 

2     Pursuing  a  study  5f  the  equity  of  basing  school  support  on  local_ 
property  asses^ents  you  gather  data  on  assessed  values  for  single 
family  dwellings  by  age  of  dwelling  and  metropolitan  area.    The  data 
appear  below. 

(a)  Analyze  the  table. 

(b)  What  substantive  explanations  can  you  provide  for  any  consis- 
tencies in  effects  across  age  and  region? 

(c)  If  school  support  is  based  on  assessed  property  values,  what  can 
you  say  about  the  distribution  of  the  burden  of  school  support 
across  these  metropolitan  areas? 
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City 


A8%  bf  dve  1 1 Ing 

Atlanta 

L.A. 

D.C. 

Chicago 

New  York 

<  5  years 

222 

238 

229 

224 

243 

6-10  years 

227 

239 

225 

231 

240 

li*30  years 

222 

221 

224 

212 

249 

21-30  years 

195 

216 

230 

HA 

NA 

31-50  years 

199 

214 

213 

198 

192 

>  40  years 

195 

206 

205 

221 

251 

Entries 

are  In  $. 

3*     Consider  the 

two  two-way 

tables 

shown  below^ 

_The  first  gives  labor  participation-rates  for  women  with 
children  in  4  age  classifications 'for  1950-70  in  5  year  intervals. 
The  second  table  gives ^ labor  participation  rates  for  married  wo- 
men in  4  age  classes  for -the  same  years*    Entries  are  %  of  women 
with  the  specified  row/column  characteristics  that  are  employed; 
e.g.,  11.9%  of  women  with  children  ander  6  were  employed  in  1955« 

(a)  Analyze  these  tables  using  median  polish. 

(b)  How  do  children  affect  the  labor  participation  rates  of  women? 

evidenced  by  the  columns  of  the  first  table?^  Your  supervisor 
is  particularly  interested  in  the  "children  6-17"  effect^  t^y 
is  this  effect  so  inuch  higher  than  the  "no  children  under  18" 
effect?    Why  doesn't  the  rate  increase  as  children  become  older? 

(c)  In  general,  is  the  participation  rate  higher  orlower  for 
married  women  than  for  women  with  children?^  Prove  to  your 
supervisor  that  this  question  is  easily  answered  by  examining 
only  one  jEitted  parameter  from  each  table; 

(d)  Present  to  your  supervisor  the  two  relationships  between  the 
6  years  in  the  tables  OTd_ the  labor_participation 

women  with  children  and  the  years  and  carried  women,    is  there 
a  linear  relationship  in  either  table?    How  do  the  fitted  lines 
compare? 

(e)  ^heck_to  see  if  th^row  and  column  effects  are  additive  in  the 
raw  unit  of  measurement*     Check  the  residuals  for  any  sign 
patterns  and  the  comparison  values  for  evidence  of  need  lor  a 
transformation*.  .  \ 
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LABOR  FORCE  PARTICIPATION  RATES^ (In  X) 
HARRIED  VamN  (HUSBAND  PRESENT) 

YEARS 


AGE 

1950 

1955 

1960 

1965 

1970 

20-24 

28.5 

29.4 

30.0 

35.6 

47.4 

25-34 

23.8 

26.0 

27.7 

32.1 

39.3 

35-44 

28.5 

33.7 

36.2 

40.6 

47.2 

45-54 

26.8 

33.9 

40.5 

44.0 

49.5 

Source:    Department  of  Labor,  Manpower  Report  of  the  President,  1973. 


LABOR 

FORCE 

PiffiTieiPATION  RATES 

(In  %) 

WOMEN  WITH 

CHILDREN 

(HUSBAND  PRESENT) 

YEARS 

1950 

1955 

1960 

1965 

1976 

With  Children  Under  6 

11.9 

16.2 

18.6 

23.3 

30.3 

Children  0^17 

12.6 

17.3 

18.9 

22.8 

30.5 

Children  6-17  Only 

28.3 

34.7 

39.0 

42.7 

49.2 

No  Children  Under  18 

30.3 

32.7 

34.7 

38.3 

42.2 

Source:    Department  of  Labor^  Manpower  Report  of  the  President,  1973. 
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4.    This  table  shows  average  infant  mortality  rates  over  1964-1966, 
whites  and  blacks, ^legitimate  and  illegitiioiiate  births,  for  4 
regions  of  the  United  States, 

(a)  Analyze  this  table  using  both  taediah  polish  and  mean  polish. 
How  do  the  two  fitted  models  compare?  If  there  is  a  differ- 
ence in  fits,  explain  why. 

(b)  Estimate  the  infant  mortality  rate  for  black  illegitimate 
births  in  the  western  United  States. 

(c)  Suppose  you  work  for  an  afency  in  HEW  and  have  a  $20  million 
1977;^78  appropriation  for  educating  expectant  mothers  in  pre- 
and  postnatal  care.    How  should  this  money  be  spent?  Discuss 
how  the  funds  should  be  allocated  to  regions  of  the  United 
States,    to  whom  should  the  educational  campaign  be  directed; 
specifically,  which  age  groups,  which  races,  etc.  The 
majority  of  your  inferences  should  be  based  on  this  table. 


AVERAGE  INFANT  MORTALITY  RATES^  1964-1966 
(average  annual  rates  per  1000  live  births) 

REGION?  OF  U.S. 


NORTHEAST 

NORTH  CENTRAL  

SOUTH 

WEST 

l^ite  Legitimate 

19.1 

21.7 

21.7 

20.0 

White  Illegitimate 

35.5 

33.3 

36.5 

31.4 

Black  Legitimate 

33.9 

44.6 

40.4 

35.5 

Black  Illegitimate 

43.6 

39.9 

45.1 

NA 

Source: 

SociQecQnofii:!  if* 

issues  of  Health, 

1974. 
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Homework  Solutions 
Unit  8 


Step  1.    Find  row  medians  - 

med 

la)  iO.O  9.0  8.5  8;3  7.8  8.S 

10.8  10.4  9.8  9.1  8.6  9.8 

10.9  10.3  10.2  9.1  B.6  10.2 


Step  2. 


m 


ed 


Subtract 

out  row  median*^ ; 

find 

column  medians 

part 

1.5 

,5 

6 

-.2 

-.7 

8.5 

1.0 

,6 

0 

-.7 

-1.2 

9.8 

.7 

il 

0 

-1.1 

-1.6 

10.2 

1.0 

.5 

6 

-.7 

-1.2 

9.8 

Step  3, 


tied 
part 


Subtract 

out  Column  medians  and 

new  row 

medians 

then  new 

coltann  medians 

med 

part 

.5 

b 

6 

i5 

.5 

.5 

-1.3 

0 

;i 

b 

0 

0 

0 

b 

-.3 

-.4 

b 

-.4 

-.4 

-.4 

.4 

b 

e 

e 

6 

0 

1.0 

i5 

ib 

-.7 

-1.2 

common 

9.8 

Step  4.    Subtract  out  hew  row  medians.    All  medians  now 

0  -.5 

0  .1 

1  b 


med 

part 

.5 

b 

0 

0 

-.8 

b 

b 

0 

0 

0 

.4 

b 

0 

0 

0 

med  0  b  b  b  0 

part      1.0  ;5  b  -.7  -1.2    common  9.8 


Resultant  Table 


NYC 
HE 


eff 
fit 


$1000 

$7000 

$10000 

$30000 

$100000 

eff 

fit 

b 

-.5 

-.5 

b 

0 

-.8 

9.0 

0 

0 

0 

0 

0 

9.8 

.1 

b 

  A 

b 

0 

0 

9.8 

1.0 

.5 

0 

-.7 

- 1  •  2  cuusnon 

9.8 

10.8 

10.3 

9.8 

9.1 

8.6 
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la)  (continued) 

Residuals 
Unit  «=  .1 

-1 

-0  55 
-0  00000 

0  0000011 

0  4 

1 

The  residuals  tend  to  be  small  (0)  or  large  (.4,  -.5)  as  we 
expect  from  a  resistant  procedure.    That  two  of  the  three  large 
residuals  are  from  NYC  suggests  further  analysis  of  this  location. 


lb)  Exmining  the  location  effects,  we  immediately  note  that 

there  is  really  only  one  effect-rNYG.    The  other  two  regions 
have  zero  location  effects ^    Further,  the  NYG  effect  Is  large 
(almost  a  full  percent)  and  negative--!. e. ,  interest  rates  In 
NYC  tend  to  be  ah&oii  a  full  percent  lower  than  the  NE  and  SW 
regions  (for  the  conditions  under  which  the  data  were  collected-- 
tix&ey  term,  loan  size,  etc^). 


ic)  ft  is  quite  clear  that  Interest  rates  decrease  mohotbnic 


£S  the  loan  size  increases  (for  the  conditions  under  which  th^p 
data  were  collected). 

id)  If  we  consider  loan  money  as  a  connnbdity  sold  In  a  market, 

then  much  of  the  size  and  location  effects  might  be  explained  by 
supply  and  demand.    It  is  likely  that  money  is  more  available 
(larger  supply)  in  NYC  than  elsewhere  due  to  the  high  density  of 
iinanciai  institutions  there.    We  do  hot,  however,  expect  demand 
to  be  correspondingly  higher  in  NYC  since  money  consumers 
(individual  or  comnercial)  are  at  least  as  numerous  In  each  of 
the  other  two  regions.    NYC,  a  "financial  capital"^  thus  exhibits 
lower  Interest  rates. 

Similarly,  one  would  expect  Interest  rates  todecrease  with 
loan  size  since. (a)  there  Is  probably  less  demand  for  loans  of 
$100,000  than  of  $1000^  (b)  the  paperwork  fbrany  single  loan  is 
probably  equivalent ^  so  lender  costs  for  gne_$100,600  loan  would 
be  significantly  less  than  for  a  hundred  $1000  loans,  (c)  there 
is  probably  less  risk  involved  with  th^  larger  loans  (would  you 
loan  $100,000  as  readily  as  $1000?). 

Note  that  these  Ideas  mlgjit  also  help  explain  the  two 
large  residuals  for  NYC ^    SupKO^^e  large  loans  ($30, 000  or 
$100^000)  are  available  only  froia  the  larger  banks  while  small 
loans  ($1000)  are  available  from  all  (but  mostly  the  smaller) 
baz^s.    The  lower  interest  fates  for  the  $76bb--$lbbbb  loans 
might  be  causod  by  a  relatively  smaller  demand  for  these  loans 


offered  by  the  large  NYC  banking  estabiishments.     (A  sipilar 
trend  would  not  be  expected  for  the  $1660  loans  since  they  may 
not  be  quite  as  readily  available  from  those^larger  ba^s,  and 
hence  not  experience  quite  the  same  degree  of  oversupply.)  ^We 
assume  all  other  considerations  (time,  term,  etc.)  are  equal. 

Should  you  follow  the  advice  of  your  board  of  directors, 
you  would  expect  to  pay  over  10%  interest  on^the  loans  (sinpe 
iiteSeit  rates  for  the  NE  for  loans  of  $10600  or  less  are  16.2/o 
or  greater).    Moreover,  should  a  single  $166000  loan  be  taken, 
the  interest  rate  would  be  only  8.67.,  a  savings  of  at  least 
1.67.  oi  $1600. 

Of  course,  an  even  more  clever  strategy  would  be  to  go  to^ 
NYC  and  take  the  loan  there.    The  resultant  interest  r^tf^would 
be  7.8%.    This  strategy  would  save  at  least  2*47.,  or  §2400 
over  that  suggested  by  your  board  of  directors. 

A  Seasonable  compromise  would  be  to  suggest  that  membe:. 
yaar  boaSd  take  a  pay  cui     o  ceopensate  the  company  for  the 
larger  cost  of  implementing  their  plan; 

Note  that  since  the  organization  is  non-profit  and  public 
iervice,  a  consideration  such  as  generating  good-will  (which  _ 
often  induces  corporations  to  pursue  more  cosstly  strategies)  ±s 
not  an  issue.    However,  if  by  taking  the  more  costly^ ^osns  from 
other  local  batiks,  other  benefits  (such  as  fund  raising  aid 
fiaa  these  institutions)  accrue,  a  more  complex  cost-benefit 
analysis  is  required. 
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med. 

2a)  : 

222 

238 

229 

224 

243 

229 

Step  1 

227 

239 

225 

231 

240 

231 

222 

221 

224 

212 

249 

222 

195 

216 

230 

216 

199 

214 

213 

198 

192 

199 

195 

206 

205 

221 

251 

206 

—  — — 

part 

Step  2 

-7 

9 

6 

-5 

14 

229 

-4 

8 

-6 

0 

9 

231 

0 

-1 

2 

-10 

27 

222 

-21 

0 

14 

--- 

216 

e 

15 

14 

-1 

-7 

199 

-11 

0 

-1 

15 

45 

206 

med 

-6 

4 

i 

-i 

14 

219 

(common) 

med 

part 

Step  3 

-1 

5 

-1 

-4 

C 

-1 

lb 

2 

4 

-7 

1 

-5 

+1 

12 

+6 

-5 

1 

-9 

13 

+1 

3 

-15 

-4 

13 



-4 

-3 

9 

13 

0 

-21 

6 

-2b 

-5 

-4 

-2 

16 

31 

-2 

-13 

part 

-6 

4 

i 

~1 

14 

219  commor: 

part 

Step  4 

0 

6 

b 

-3 

1 

9 

1 

3 

-8 

b 

-6 

13 

5 

-6 

b 

-10 

12 

-li 

0 

17 

-7 

0 

3 

7 

-6 

-27 

-14 

-3 

-2 

b 

18 

33 

-15 

med 

0 

2 

b 

-3 

1 

219 

commot 

part 

-6 

4 

i 

-1 

14 

Step  5 


part 


med 

part 

b 

4 

a 

b 

b 

0 

9 

1 

1 

-8 

+3 

-7 

1 

13 

5 

-8 

b 

-7 

11 

D 

$ 

11 

-2 

17 

-2 

-7 

0 

i 

7 

-3 

-28 

0 

-14 

-3 

-4 

b 

21 

32 

0 

-15 

-6 

6 

i 

-4 

15 

219 

ERIC 


med 
part 


part 

U 

0 

0 

b 

9 

0 

0 

-9 

2 

-7 

14 

5 

-8 

0 

-7 

11 

4 

9 
Q 

0 

19 

-9 

1 

7 

-3 

-28 

-14 

3 

-4 

0 

21 

32 

-15 

0 

6 

6 

0 

b 

6 

6 

1 

-4 

15 

219 

cuujuion 


Step  7 


med 
part 


med 

part 

e 

3 

4 

0 

0 

b 

b 

9 

0 

-9 

2 

-7 

b 

14 

5 

-8 

0 

-7 

11 

b 

4 

9 

0 

19 

b 

-9 

0 

1 

7 

-3 

-28 

b 

-14 

3 

-4 

b 

21 

32 

b 

-15 

0 

0 

0 

b 

b 

219 

comnon 

6 

6 

1 

-4 

15 

The  resultant  table  is 


<  5  years 

6-10 

11-20 

21-30 

31-40 

>  4b 

eff 
fit 


lahta 

LA 

DC 

Chicagc 

eff 

fit 

0 

4 

d 

b 

b 

9 

228 

0 

0 

-9 

2 

-7 

14 

233 

5 

-8 

0 

-7 

11 

4 

223 

-9 

0 

19 

-9 

21Q 

0 

1 

7 

-3 

-28 

-14 

205 

-3 

-4 

0 

21 

32 

-15 

204 

-6 

6 

1 

-4 

15 

consnon 

=  219 

213 

225 

22b 

215 

234 

2b) 


Note  the  larger  NY  effect,  and  the  large  NY  reisiduals. 
Note  that  except  for  <  5  yrs.,  age  effect  decreases  as  age 
increases. 

One  might  expect  assesraents  to  correspond  to  a_ large  degree 
to  the  cost  of  living  of  a  givenarea*^  This  certainly  appears  to 
be  the  case^  with  the  greatest  location  effect  for  NYC  (which 
has  the  hi^est  cost  of  living  of  those  cities  examined)  and  the 
smallest  location  effect  (greatest  negative)  for  Atlanta.  Sim- 
ilarly for  LA^         and  Chicago. 

Similarly,  we  expect  assessment  to  decrease  with  dwelling 
age.     This  certainly  appears  to  be  the  case^  except  for  dwellings 
constructed  In  the  past  5  years.    Since  these  are  aggregate  fig- 
ures, this  phent  .enon  might  be  explained  by  a  recent  wave  of  low- 
cost  housing  construction  tti  the  larger  cities. 
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If  school  8ixppbrt_  is  based  oh  assessed  property  values^  ve 
expect  the  burden  of  this  support  to  fall  most  heavily  upon 
newcomers  to  the  city^  i.e.,_thb8e  who  move  into  the  newer 
(<  10  yrs.  old)  dwellings.    This  assumes  a  minimal  mobility  on 
the  part  of  longtime  residents.    In  ah  area  of  higher  ihterhal 
mobility^  the  burdenof  support  falls  more  heavily  oh_ those  who 
think  they  can  afford  to  move  into  hewer  (<  10  yrs.  old)  dwellings* 

Hie  impact  of  such  an  assessment  structixre  may  be  to  create 
a  dlslhcehtive  to  the  constructiph  of  hew  homes  with  a  correspond- 
ing loss  of  Jobs  axi^d  a  disincentive  to  mbbility  by  young  upwardly 
mobile  families.    This  would .yield  reduced  total  tax  revenues  and 
make  difficult  the  support  of  schooling  in  geheral.    Such_ effects 
would  be  particularly  hard  felt  ih  NYC  which  could  probably  use 
such  potential  tax  revenues  most.    Thus^  basing  school  support  on 
property  taxing  ±s  hot  likely  to  yield  equitable  education^  at 
least  on  the  basis  of  these  data. 


QMIM 

3, a)  fable  t  is  analyzed  by  median  polish  in  Figures  A  throu^  D, 
Table  II  in  Figures  E  throu^  H.  In  each  case,  tie  analysis 
proceeded  as  follows 

i)    The  data  were  entered 
it)    The  tables  were  polished  (Figure8_A  and  E) 
iit)    The  residuals  were  analyzed  (Figures  B  and  F) 
tv)    A  diagnostic  plot  was  made  (Figures  6  and  G) 
v)    A  resistant  line  was  calculated  for  the  diagnostic  plot 
(Figures  D  and  H) 

By  themselves  these  Figures  do  not  constitute  a  complete  analyisis; 
we  must  interpret  these  Figures.    This  is  done  in  parts  (b),  (c), 
(d),   (e),  and  (f)  below. 

We  should^  however^  be  very  cautious  in  ourinterpretation  of 
these  data  since  we  do  hot  know  how  participation  rates  were 
5alculated^  what  the  base  group  was  (i.e^^these  are  percents  of 
what  group?) i  what  the  eligibility  requirements  for  this  base 
group  were^  and  whether  in  fact  all  of  these  considerations  were 
even  consistent  for  all  of  the  years  in  question.  Manipulating 
these  factors  (or  just  changing  definitions  from  year  to  year), 
can  create  a  table  of  "participation  rates"  which  reflect  any- 
thing we  wish. 

For  this  problem*  however,  we  will  assme  that  the  above  points 
have  already  been  addressed  and  answered  to  oar  satisfaction. 

b)    Predictably^  the  presence  of  young  children  <<  Sjreafs  old) 

SCOTS  to  lower  the  labor  participation  rate  of  women,  as  evidenced 
by  the  large  negative  effect  of  the  first  two  rows  of  Table  II 
(compared  to  the  large  positive  effect  for  the  fourth  row-- 
women  with  no  children  under  18) .The  presence  of  children  in 
general  also  seems  to  lower  the  average  rate  (see  (c)  below). 

More  interesting,  however,  is  that  the  rate  for  women  with  OLDER 
children  (6  to  17  years  of  age)  is  HIQIER  than  that  for  women 
with  no  children  under  18.    We  can  hypothesize  at  leastseveral 
reasons  why  this  might  be  so  (although  this  is  an  excellent 
question  for  further  study) : 

--families  with  older  children  are  toore  likely  to  heed  the 
additional  income; 

—women  with  children  over  18  are  likely  to  be  older,  and  hence 
possess  fewer,  obsolescent,  or  just  "rusty"  skills.  (but  ^ee 
the  effects  of  age  in  Table  i); 


^031 


•-many  women  whose  children  are  over  18  (note  that  this  is  an 
open  ended  age  bracket;  the  children  coaid  be  37)  may  have 
reached  voluntary — or  mandatory — retirement  age; 

Note  also  that  these  effects  are  NOT  constant  over  time «    In  1950, 
the  rate  for  women  with  children  between.  6  and  17  was  lower  than 
thai  for  women  with  children  over  18,    Yet  thereafter  the  situa- 
tion is  reversed;    This  situation  might  be  doe  to  the  large 
utilization  of  women — especially  women  without  children  at  home-- 
in  the  work  force  during  World  War  II  (whose  effects  would  contin- 
ise  for  several  years,  perhaps  even  through  1950),'  and  perhaps  also 
Korea  (1950-1952). 

A  similar,  although  temporary,  reversal  occurs  between  the  rates 
for  women  with  children  under  6  and  with  children  under  17  in  1965. 
Reasons  for  this  situation  are  more  difficult  to  propose; 


Comparing  the  common  values  for  each  table  as  general  Indications 
of  OVERALL  level,  we  note  that  the  common  value  for  fable  I  (33;7) 
is  greater  than  that  for  Table  II  (27; 0)^  which  suggests  that  the 
participation  rate  for  married  women  in  general  is  greater  ^y 
about  7%)  than  that  for  women  with  children.     (This  should  not  be 
too  surprising). 

First  note  that  there  are  only  FIVE  years  given  in  each  tables 

These  plots  are  shown  for  fable  I  in  Figure  I  and  for  Table  II  in 
Figure  k.    The  resistant  line  for  each  is  shown  in  Figures  J  and 
L  respectively. 

Both  plots  are  nearly  linear,  the  third  and  fourth  point  of  each 
lying  somewhat  below  the  fitted  line. 

Note  the  similarity  in  slope  between  the  two  fitted, lines. 
(;896  vs.  i855)i    The  two  lines  therefore  differ  only  by  a 
constant  of  about  8%.     (This  is  calculated  by  comparing  the  brdi- 
nates  at  each  of  several  years.    We  cannot  simply  compare  the 
constant  terms  of  the  two  resistant  lines  since  the  slopes  are  not 
precisely  equal). 

Note  how  this  corresponds  to  our  answer  In  (c)  above.    The  parti- 
cipation rate  for  married  women  seems  to  have  been  consistently 
(over  time)  about  S%  higher  than  that  for  women  with  children. 

To  check  additivity^  we  examine  three  indicators: 

i)    the  residuaj.  sign  patterns  (Figures  A  and  E) 
ii)     the  residukl  behavior  (Figures  B  and  F) 
iii)    the  diagnostic  plots  (Figures  C,  D,  and  G,  H) 


A  residual  sign  pattern  for  Table  i  (Figure  A)  does  not  seem 
particularly  prominent.    The  poor  behavior  of  the  residuals 
(Figure  B.5  points  to  a  definite  lack  of  additivity.    The  diag- 
nostic plot  (Figure  C)  ^^ose  slope  (as  calculated  in  Figure  D)  is 
1.46i  confirms  this.    Reexpresston  will  be  pursued  in  part  (f ) . 

Slmilarlyi  there  is  no  residual  sign  pattern  for  Table  II 
(Figure  E).    The  residuals  (Figure  F5,  while  not  especially  well 
behaved,  have  few  outliers.    The  slope  of  the  resistant  line  (cal- 
culateain  Figure  M)  for  the  diagnostic  plot  (Figure  G)  is 
extremely  close  to  0,  a  decisive  indication  of  additivity. 

f)    We  noted  in  (e)  above  that  the  slope  of  the  resistant  line  for  the 
diagnostic  plot  for  Table  I  (Figures  C,  D)  was  1,46,  confirming 
the  other  indications  of  nonaddtttvity.    This  value  (approximately 
1.5)  suggests  reexpression  to  reciprocal  roots.    Since  reciprocals 
(let  alone  reciprocal  roots)  are  difficult  to  interpret,  a  log 
reexpression  Was  tried  firsts 

Figures  M  through  P  show  the  analysis  of  the  log  data.  The 
behavior  of  the  residuals,  and  the  slope  of  the  resistant  line  for 
the  diagnostic  plot,  both  suggest  the  inadequacy  of  this  trans- 
formation. 

The  (negative)  inverse  reexpression  (still  somewhat  easier  to 
interpret  than  inverse  roots)  was  tried  next.    Figures  Q  through 
T  show  this  analysis,    the  residuals  are  much  better  behaved, 
although  the  slope  of  the  resistant  line  for  the  diagnostic  plot 
suggests  (predictably  enougjt)  a  further  reexpression  by  square 
roots . 


Ah  analysis  of  the  (negative)  inverse  roots  might  therefore  be 
done  next,  if  the  increased  additivity  is  deemed  worth  the 
corresponding  increase  in  difficulty  of  interpretation. 
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A. a)  The  analysis  using  median  polish  is  shown  in  Figures  U  through  X; 
the  hand  calculations  are  shown  in  Figure_Y  for  comparison-  (Note 
that  hand  calculations  were  ddtie  using  mortality  rates  per  10,000 
live  births.  Differences  between  computer  and  nand  calculations - 
aside  from  the  units  difference,  are  due  to  rounding) - 

The  mean  polish  is  shown  in  Figure  Z,    Residual  and  diagnostic 
plots  were  hot  done  for  the  mean  polish.    See  also  (d)  below. 

Comparing  the  fitted  ir.cO c^ls  from  the  mean  r  A  median  polish 
(Figure  Y  vs.  Z)   -shows  the  two  to  be  sittsii.  r ,  althougS  there  is 
reason  to  expect  the  two  to  be  the  same .     Indeed  ,  just  as  we 
expect  the  mean  and  median  to  be  the  same        7  in  very  well- 
'oehaved  batches^  we  expect  a  difference  >   i^^^  -    the  results  of 
mean  and  median  polish  in  real  (and  hence  not  likely  to  be  well- 
behaved)  data. 

Whien  performing  the  analysis  on  this  ^ata,_we  shall  use  the 
median  polish,  since  it  has  the  desirable  quality  of  being 
resistant • 

b).    This  is  estimated  by:     coiismon  +  west  effect  +  B-I  effect 

From  Figure  U  (median  polish):    36.076  +  6.705  +  (-2.576)  =  40.25 

From  Figure  Y  (hand  calculated  median  polish):     36.2  +  7,  i  4- 
(-2.9)  =  40.4 

From  Figure  Z  (hand  calculated  mean  polish):     34.0  +  8.3  + 
(-2.3)  =  40.0 

(Note  the  similarity  among  the  three.) 


c)     One,  equitable  distribution  cm:  funds  to        h  geographical  region 
would  be  in  proportion  to  t'tv  vec^i,  i-e,i  In  proportion  to  the 
column  jits.    Hence,  we  would  ^'.llr  cate  (from  Figure  D). 

^^^^  ^  ^    =  24-9%  of  the  total  to  the  NE 


35.8 

•f-  36.3  +  38.4  + 

33.5 

36.3 

35.8 

■1-  36.3  +  38.4  + 

33.5 

38.4 

35.8 

36.3  +  38.4  + 

33.5 

_  33^.. 

=  25.2%  of  the  total  to  the  NC 
=  26.7%  of  the  total  to  the  south 


35.8  +  36.3  +  38.4  +  33.5       23.2%  of  the  total  to  the  west 
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The  educational  campaign  la  each  region  might  then  be  directed 
to  each  of  the  foar  group*  in  proportion. to  the  calculated  fits 
(or  actual  observed  values)  for  that  region.     (NOT  the  row  fits, 
which  "average''  overall  regionsji 

The  above  method  however  only  responds  to  the  data  presented  to 
us  In  Table  III.    A  far  better— although  long  terns— solution  _ 
would  be  to  determine  the  (probably  common)  underlying  causes  of 
infant  taortaltty  and  allocate  the  money  to  a  -entralized  facility 
(for  medical  researdh  or  training  of  medical  personnel  for 
vxmpU)  ..  to  Regions  in  proportion  to  iieed  (for  more  maternity 
v:.rc  bsi^P,  or  simply  more  ambulances),  or  perhaps  even  to  nation- 
al and  segionai  mass  media  for  educational  broadcasting,    in  any 
case,  UEDERSfMi)  the  problem  before  pouring  money  into  it.  These 
data    do  :  NGT  provide  all  the  required  information  for  UNDER- 
STANDING the  pS6blem.    We  doh^t  even  know  if  the  observed  pat- 
terns are  consistent  over  time. 


in  part  (a)  we  itialyzed  the  raw  data  by  median  and  mean  polish. 
The  slope  of  the  resistant  line  (Figure  X)  of  the  diagnostic  plot 
'figure  W)  suggests  reexpressibn.    Although  the  value  of  the 
slope  (-.39),  or  about  -1/2)  suggasts  reexpression  by  the  3/? 
power,  a  more  easily  interpreted  reexpression  is  to  square  the 
data  <2  power).    Ah  analysis  of  the  squared  data  is  shown  (by 
median  polish)  in  Figures  AA  through  DD.    Note  the  slope  of  the 
resistant  line  (Figure  DD)  of  the  diagnostic  plot  for  the^ 
reexpressed  data.    We  might  consider  using  the  fits  from  THIS 
analysis  (Figure  AA)  in  part  (c)  above. 
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two  VAY  TABLE  OF  DATA  IN  VARiABLEi    INEMORT  FKaJRE-U: 
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/* 

i  :  
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MICROCOPY  RESOLUTION  TEST  CHART 
NATIONAL  BUREAU  OF  STANDARDS 
STANDARD  REFERENCE  MATERIAL  iOiOa 
(ANSi  and  ISO  TEST  CHARt  No.  2) 
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AFTER      I  STEPS  OF  POLISH  THE  FITTED  RESISTANT  LINE  ISt 
i^S3  «  0.0526  *         -0.38B5  *  CVS3 
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jirWO  VAf.  TABLE  OF  DATA  IN  VARIABLEi   XNFMORS  FIGURE^ 

♦ 

I  2  .  3  4 

it  364^8098         470^8894         470iS894  400.0000 

2s       1260.^2493       1108^8892       1332i2493  985^9600 

3t  -  ii«9.2090  1935i998S  1^32^1597  1260^2493 
4l        1900^9587       1S92..0078       2034^0081  N.A^ 

B«B4ENtARY  ANALYSIS  BY  MEDXAN  POtZSHi 
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-1297.5637 


1  MISSING  VALUES  IN  SAVED  RESIDUALS. 
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PLOT  RES5  VS  CVS5 
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T^le  1  app»red  in  a  rece^^  issue  of  ^h^Retlred  officers'  Joariiai. 
It  presents  the  iKmthly  pay  received  by  members  of  the  US  armed  forces  by 
pay  grade  (job  levels  4-10)  and  years  of  iervlce  (12-26)  effective  October, 
1976 ; 

 ^  Assume  that  you  are  a  staff  mena>er  of  a  cohgressiohal  committee 

idiich  is  considering  the  unioniratim_^    the  armed  forces.    Your  super- 
visor wants  to  contrast  pay  in  the  milltaiy  with  pay  received  by  pro- 
fessionals in  unionized  situations  (such  as  at  some  universities) .  But 
first  she  wants  to  understand  the  table  and  has  asked  you  tc  analyze  it. 

 The  analysis  has  been  done  for  you  by  computer.    Parts  of  the 

analysis  and  questions  about  ttie^^  parts  follow. 


i.a    Is  it  true  that  an  individual  in  the  armed  forces  gets  a  pay  raise 
every  year?    Explain  your  answer  by  reference  to  Table  1, 

l.b   What  is  the  monthly  pay  for  someone  pay  grade  7  who  has  been  in  the 
service  for  16  years? 


fable  2  shows  the  pay  data  in  median  polished^  'Hbordered  table*'  form. 

2.    Based  only  on  fable  2    and  the  stem  ;uitd- leaf  display  of  the  residuals 
from  the  fit  in  Figure  1,  argue  that  to  determine  the  monthly  pay  of 
an  individual  in  the  armed  forc£^  w\e  needs  more  ihfbmation  than 
p^  grade  and  years  of  service  ox  v.he  individual..  Assume  that  the 
individimi  under  consideration  in  Itt  pay  grade  4-10  and  has  been  in 
the  service  either  12,  14,  16,  IS,  IJ,  22,  or  26  years. 


Figures^l,  3^  4  show  diagnostic  plots  of  the  untrans formed  data  and 
two  transformations,  base  10  logaritlans  and  square  root. 

3.  Sbw  are  Mc^partsbtt  ^loe     defined?    In  simple  layman's  tion- 
quantttattve  language,  tell  you^  (and  us,  of  course)  the 
purpose  of  the  diagnostic  plot  and  why  a  log  or  square  root  trans- 
formation might  be  required. 

4,  What^s  the  preferable  saode  of  analysis  for  this  table,  a  transfor- 
awitton  of  the  data,  or  an  extended  fit?    Give  the  equation  of  the 
extended  fit  for  these  data. 
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Lastly  consider  the  plots  of  the  effects  versas  respective  variable 
in  Figure  5. 

5.    Construct  a  simple  equation,  a  function  of  pay  grade  and  years  in 

service,  that  approximates  the  monthly  military  pay  of  an  individual. 
What  Is  the  yearly  pay  of  an  individual,  grade  9,  with  24  years  of 
Service? 


TABLE  l,--MOOTHLir MILITARY  BASIC  PAY. 
ENTRIES  ARE  IN  $ 


Years  of  Service 


Pay  Grade 

12 

14 

16 

18 

20 

22 

26 
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3047 
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6 

1703 

1761 

2040 
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1820 
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2051 
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loeo 


TASZ£  2 .  --^ffiBiaj  POIISH  OF  ^ffmeay^BLE 
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i,--^MmLS  FROM  MEDIAE  PQHSH  FIT. 

unit  «  lb 
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2  »  "hard"  sero. 


^062 


ERIC 


XVI. IV. 85 


'^-fgS.  -75.  -25.  25.  75.  125 

COIiPARSNsUALUES 

ROBUST  EQUftflON  IS*  Y»8 . 764703X+5 . 25783 


063 


-2.5 


-5  0 
-3 


-2.09 


-1.80 


e.00  1.00  2.00  3.00 

ROBUST  EQUflTlQN  IS«  Y--1. 38231  X+0. 000554 


065 


FiGBBE  4 


R 
E 

S 
I 

D 
U 

L 


e  e 


BIftSNgSTIC  PLOT  OF  SQRKHILI 


-2.5 
-0.50 


-0.25 


e.eo 


3 


/   3  HALFSTEPS 


e.25  0.50 
eOMPSRSN-VALUES 


ROBUST  EQUAnOM  IS-  Y«0.354(S79X-0. 106905 


087 


mdt  8  _ 

Quiz  Solutions 


la.    Although  the  general  tendency  is  for  pay  to  imrease  with  years  of 
service,  pay  does  hot  increase  for  every_2  years  of  service  within 
a  given  pay  grade.    For  example,  in  grade  4,  there  is  no  increase 
after  18  years  of  service. 

b.    Ifonthly  pay  for  scmieone  in  pay  grade  7  with  16  years  of  service  is 
$2,678. 


2.      First  of  all,  without  a  table  illustrating  the  residuals  the  exact 
monthly  pay  can  only  be  estimated.    Second,  the  stem-and-^ leaf  of 
the  residuals  shows  that  they  are  not  a  weii-b^ayed  batch  due  to 
the  presence  of  outliers.    This  would  indicate  that  the  additive 
^del  is  "ndssihg  part  of  the  action".    Third,  the  coded  residuals 
also  show  that  the  largest  and  smallest  residuals  art  along  the 
edges  of  the  table,  indicating  the  need  for  a  transformation  or  an 
extended  fit. 

All  of  these  clues  indicate  that  the  effects  and  residuals  of  the 
linear  model  shown  in  Table  2  are  inadequate  for  suo^rizing  the 
data  and  a  transformatibh  or  extended  fit  should  be  tried. 

3a.    Comparison  Values  »  <ra^^^^£fBCt^J^olA^ 

cottsnon  term 

h.    A  diagnostic  plot,  which  comparison  i^ lues  on  the 

horizontal  axis  and  the  residuals  from  a  median  polish  bti  the 
vertical  aacis,  is  an  indication  of  the  adequacy  of  th^  additive 
model .    If  the  plot  indicates  a  linear  relationship  between  the 
residuals  and  the_ comparison  values,  then  the  additive  mbdel  is 
inadequate^    We  should  try_  a  transformation  of  the  drigittal  data 
or  an_ ^tension  of  the  additive  model,  via  multiplicative  inter- 
action tenra.fhesiopeof  a  Ita       relationship  found  iti  the 
diagnostic  P^otshouldbe  subtracted  from  1  to  determine  a  trans- 
formatibh that  might  be  appropriate.    Inthis- particular  case, 
the  plot  has  a  8lbpe_^f  .76.    Subtracting  this  from  1.0  gives  us 
.24  (about  1/4).  We  want  to  keep  the  transformations  si^lc  so 
we  try  a  square  root  instead  of  a  quarter  root  (moving  slightly 
up  the  ladder  of  powers)  or  a  lbgarit&  (moving  slightly  down  on 
the  ladder  of  powders) . 

4a.    Ah  extended,  fit  of  the  data  is  the  preferable  mode  of  analysis 
for  this  table. 

-  it  makes  sense  that  there  wbuld  be  an  interaction  between  level 
or  grade  achieved  and  the  nu^er  of  years  spent  in  the  service. 


;  ^  _ ,   _         .Mii-M  J  '^  
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-  the  diagnostic  plot  Indicates  a  quarter  root  arid  not  one  of  the 
simpler  trans  format  Ions  we  like  to  work  with. 

-  transformations  uiltig  square  roots  and  logarithias  didn't  craple- 
tely  work. 

^  the  coded  residuals  show  that  the  largest  and  sxnallest  residuals 
are  at  the  borders  and  cbxriers  of  the  table* 

To  make  sure  th^  ext  fit  was  the  best,  eoapftr«  the  ^  {  Residuals | 

from  the  extended  fit  and  the  two  trans forcoat ions  for  the  i^niallest 
#uffl^  (assuming  that  the  residuals  were  all  placed  in  the  same  units). 

The  extended  fit  model  is 

Data  =  Ccranoh  +  Row  Effect  +  Col  Effect 

^  ~  ^Row  effect  •  Col  effect  . 
^  GonBQoh  term 

Data  =  2689  +  RE  +  CE  +  (.76/2689)  (RE-CE) 
-  2689  +  RE  +  CE  +  .00028  RE'CE 

Data  =  2689  +  [-750  +  40 /year]  +  [-2750  +  40b/pgl 
=  2689  -  740  -  2750  +  40/year  +  4bb/pg 
=  -801  +  40 /year  +  400 /pg 
for  Grade  =  9,  Years  =  24 
Monthly  Pay  =  -801  +  40(24)  +  400(9) 
=  -801  +  960  +  3600 
«=  3759 

Yearly  pay    =  3759-12  «  $45,108 
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Lecture- 


Reading 


g-Q  Tahutt    Pages  52-65___  ^  ^ 

Blckel,  Haiamel,  b'^eonnell  article  In  _ 
Fair ley  and  Hosteller ^  pages  113-30 


9-1 


Mueller,  et.al.  Pages  480-8 
Fietiberg^     Chapters  1  and  2 


g-2  Mueller,  et.al.      Pages  489-500 

Flenberg,     Chapter  3 

5-3  Flenberg,    Chapter  4 

Flenberg,     Chapter  5 

In  additioti.  please  read  any  articleB  In  Fairley  and  ^tosteller  that 
you  have  not  already  read. 

Texts: 

r.lrley,  W.  .rf  f .  H6it.ll«.  -rin.Hr.  ..H  V,.Mtr  foHcy,  R«ai»g, 
HaiB.:    Addtson-Wesley ,  1977. 

M.I.T.  Press,  in  press. 

T  «      «fc  «1      ^r--^-^^^'-'*  R«.aBonlhg  <ri  Soctology.  Third 

T.nuf,  i.,  .»».,  M"°".  5?.Miftc.:    A-Gttid.  tn  the  Bnta.»^,  S«. 
Francisco:    Holden-Day,  1972. 
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Lecture  9-0  i    Introduction  to  Unit  9 
introduction  to  Unit  9,  Discrete  Multivariate  Analysis 

lecture  Content: 

1.  Discrete  vs.  Continuous  Multivariate  Data 

2.  Multinomial  Distribution  for  Contingency  Tables 
3  •  Examples 

Main  Topics; 

1.  Discrete  Multivariate  Data 

2.  Multinomial  Distribution 

3.  Examples  of  Contingency  Tables 
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Topic  1.    Discrete  Moitivarlate  Data 

!•    Basic  Issue:    New  "type"  of  data 

li    Everything  we  have  discussed  thus  far,  both  respbose  and 
carrier  variables,  has  been  cofltiguous 

2.  This  implies  that  within  a  specific  range,  the  dependent 
variable  could  take  on  any  possible  value 

3.  For  this  unit^  we  change  this  assumption 


II,    Problem:    How  do  we  structure  "discrete"  data? 

1.  We  how  assume  that  we  have  a  set  of  varisbies  that  take 
on  only  a  finite  number  of  discrete,  values 

a.  Hbrebver^  within  this  set    we  have  no  "independent/ 
dependent"  dichotomy 

b.  Example:    Alive/Dead  variable ;oniy  twb  values  or 
categories 

2.  We  take  all  bur  variables    and  look  at  ail  combinations  of 
the  categories 

a.  We  examine  all  possible  intersections 

b.  Each  intersection  is  called  a  cell 

3.  We  then  take  a  sample  (perhaps  exhaustive)  from  a 
population,  sample  size  N,  and  record  the  number  of 
bbservations  falling  within  each  cell 

4.  Number  of  bbservations  in  each  ceil  is  called  the 
frequency  count  of  the  cell 

HI.    Solution:    Data  structure  is  a  ebntingehcy  Table 

1.  The  set  of  all  cells  aS  the  frequencies  bf  the  cells 
is  called  a  ooatingency  table 

2.  The  set  of  all  frequencies  is  known  as  a  Discrete 
Hult^variate  Data^Set 

a.  The  number  of  variables,  n,  is  the  dimensionality 
of  the  contingency  table 

b.  n  may  be  1,  2,  3,  etc. 
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IV*    Methods:    How  do  we  analyze  a  cbhtliigency  table? 

-  -    -  2 

1.  Generally  researchers  have  calculated  a  X    statistic  for  

the  table  and  stated  whether  the  statistic  was  greater  than 
the  tabulated  5%       value ^  and  theii  called  It  quits 

2.  No  one  really  knew  what  to  do  with  a  table  of  dimension 
>  3  —  could  only  handle  1  or  2  dimensional  tables 

3»    Lately,  we  have  begun  to  understand  higher  dimensional 

tables  and  have  developed  a  sophisticated  new  technology — 
the  log-linear  model— for  the  analysis 
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topic  2.    Muitinomial  Distribution 

I.  Basic  Issue:    Probability  model  foi:  i  Contingency  Table 

1.  We  have  k  cells 

2.  P {observation  lands  in  the  1th  cell}  '  f^i  i-  ranges  over 
all  ceils 

3.  We  take  a  sample  Y       size  N 

a.  t  -  iy^^i  "'* 

b.  yj  -  appropriate  cell  for  Jth  observation 

4.  Let       -  nuiAer  of  observations  falling  in  the  ith  cell 

II.  Solution:    Multihomial  Distribution 

Pi     P2      •••  Pk 


^  Xi! 
1-1 


where    E  p    -  1»       E  Xj^  -  N 
1    ^  1 


2.    So  the  probability  distribution  for  a  k  dimensional 
contingency  table  is  the  multinomial 


t  ,1  '^^^ 
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Topic  3.    Exatnples  of  Tabies 

i.    1  dlmenslonai  tabie — simple  multinomiai  (2) 

1.  feat  fit  to  a  taibwn  distribution  (uniform) 

2.  x2  goodness  of  fit  test 

11.    1  dimensional  table— simple  multinomial.     Another  goodness  of 
fit  test,  bat  what  distribution 

__    (3) 

111.    2  dimensional  table — 2x3 

1.  test  for  independence  between  the  variables 

2.  test  for  independence 

IV.    3  and  4  dimensional  tables  (4) 

1.  What  do  we  do? 

2.  independence  between  which  variables? 

3.  there  are  many  different  models  to  consider 
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Lecture  9-1.    Simple  Miiltinomiais 
Slinple  ^Itinomials — Testing  for  Goodness  of  Fit 

tfecture  Content; 

1.  Determination  of  appropriate  probability  models 

2.  Pearson's       test  for  goodness  of  fit 

3.  Discrete  prbbabllity  models 

4.  eontiiraous  probability  models 

Main  topics; 

1.  Making  direct  inferences  about  distributions 

2.  Specific  probability  models  to  fit 
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Topic  1.    Haking  direct  inferences  about  distributions 

1.    Basic  Issue:    Does  population  distribution  have  a  specific  form? 

1.  Examine  aapiricai  (smnple)  distribution 

a.  Group  data  into  a  set  of  qualitative  classes,  C  of 
them 

b.  Compute  ^pected  frequencies  for  each  cell  with 
specific  l^btheslzed  distributions 

2.  Test  the  "goodness"  of  various  theoretical  distributions  (1) 
for  the  data 

il.    Solution:    Goodness  of  Fit  Test 

1.  Have  some  Null  hypothesized  expected  frequencies  E^-H 

2.  Data  give  you  observed  frequencies  {C^} 

3 .  Compute 

2  c 

1-1 

/.      v2.    r^our  rKa  aduared  difference  of  0.  and  E,  inversely  (2) 


-:    weight  the  squared  difference  of  0^  and  E.  inversely 
by  E.;  cells  with  large  departures  get  mofe  weight 
if  E^  is  small 

5.    The  quantity       is  called  Pearson's  Chi-Square  Statistic 


III.    Method:    How  do  we  determine  whether  to  reject  H^? 

1.  X?  for  large  N,  is  distributed  as  a  x^random  variable  (3) 
with  e-1  degrees  of  freedom,  when  H^  is  true 

2.  We  lose  1  d.f .  since  N  is  fixed 
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if  x2  >  xV.     »  reject 

a.    Probability  that  the  sample  data  accord  vlth  is 
quite  small 

Doubtful  that  this  observed  could  have  origin  by 
chance 

When  can  ve  use  this  Inferential  prdced:ire? 

a.  Each  and  every  sample  observation  falls  into  one  and 
only  one  category  or  class  Interval 

b.  The  outcomes  for  the  N  observations  In  the  sample 
are  Independent 

c.  Sample  size  N  must  be  large 
1.     If  0-1=1,  Ej>10 

11.     If  C-1>1,  E.>5 
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Topic  2.    Specific  Probaliiiity  Models  to  fit 
!•    Discrete  Models 

1,  Binomial  (n,p) 

a.    n  or  fewer  cells 

If  we  estimate  p,  we  lose  an  additional  1  df 

2.  Poisson  (X) 

a.  ?  ceils 

b.  If  we  estimate  X,  we  lose  an  additional  1  df 

II.    Gaussian  probability  model 

1.  Must  take  infinite  range  and  break  it  up  into  a  finite 
number  of  ceils 

2.  Postulate  Gaussianity- 

a.  Convert  every  observation  into  a  standard  score 

b.  Lose  i  df  for  each  parameter  (U,<J)  that  we  must  estimate 

3.  How  many  ceils?    Suppose  we  desire  C. 

a.  Make  intervals  of  equal  width. 

Max-^in  _  2XMax^^) 

Mln,  Min  +       C        ,  Min  +  C  ^ 

b.  6v  mike  each. interval  such  that  probability  of  an 
interval  is  1/C 

4.  Fixed  prbbability  intervals  are  preferred  to  fixed 
width  intervals. 
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I^ecture  9-2.    2x2  Contingency  Tables 


Module  IV 


2x2  Contingency  fables:    Examining  Interactions 

tecture  Cbntmt: 

2x2  array  of  coimts 
2*    Measuring  association 
3.    Log-linear  models 
4^    Testing  for  independence 

Main  Topics : 

!•    Cross-Product  Ratio  for  2x2  tables 

2.    Log-linear  model  and  presence  of  interaction 

(There  are  no  transparencies  for  this  lecture) 
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Topic  1. 

i. 


eross-Product  Ratio  for  2x2  tables 
Basic  Issue:    Structure  of  Data 

1.  Variables  A  and  B,  both  at  levels  1  and  2 

2.  Two  dimensional  array  of  counts 


li. 


1 

1 

*il 

*12 

2 

*21 

^2 

*+l 

x^2 

1  '^^^.-^ 

3.  are  positive  integers 

4.  We  can  convert  the  x^^  Into  probabilities 


5.  If  we  know  N,  x.^,  an.l  x.^,  specifying  |ny  cell  in  the 
table  allofcS  us^    to  fill  In  the  other  3  ceils 

6.  Hence,  table  itself  has  only  1  degree  of  freedom 
after  specifying  N,  row  margin,  and  column  margin 

Problem:    How  do  data  MAlbit  interaction? 

1.  If  variables  A  and  B  are  independent,  then  x  - 
product  of  the  marginal  distributions 

2.  As  variables  A  aSd  B  exhibit  more  and  i»re,noo-^^o^ 
interaction,  then  x^^  differs  more  and  more  from  ^  x^^x^, 

3.  How  do  we  best  measure  the  interaction  present  between 
A  and  B 


III.    Solution:    Cross-Product  Ratio 

1.    Natural  "measure  of  association' 

^l'^22 
o  ■  ^  

*i2'^21 


:XVi.IV.112 


Module  IV 


Properties  of  a 

a.    if  A  and  B  are  tndepend^t^  o"l 

b*    a  is  toyarimt  tmder  the  simultaneous  interchange  of 
rovs  and  columns 

c.    o  is  invariant  under  row  and  colvom  multiplications 
(not  true  for  x2) 

a  is  also  called  the  odds  ratio 

Pll^Pl2 
P2i/P22 

a.  Pi4/Pi2_"_^^i®  9?_^?^8_iS_ the  first  level  of  B, 
given  that  you  are  in  the  first  level  of  A 

b,  ^21-^22  on  being  in  the  first  level  of  B, 
gxven  that  you  are  in  the  second  level  of  B 
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topic  2.    ldg-Line«  ibdel  and  presence  of  interaction 
I.    Basic  Issue:    Null  model  for  data  stTOcture 


1.  If  A  and  B  are  

iott  n     -  U  +  U        +  U       ;  i-i.2,  J-1.2 
log  Pj^  2  (J) 

«.    0  -  I  Ptj 

c.    B  +  tJjyj  -  I  (log  Pj^j  +  P23^' 

2.  If  A  and  B  exhibit  interaction,  then 
log  Pij  -  e  +  Uj(i)  +  U^y)  +  »i2(ij) 

"^12(11)"  '"12(12)  "  ■"12(215  "  "l2(22) 
^12 (ij)         interaction  terms 

II.    Problem:  How  do  we  estimate  the  parameters,  and  determine 

whether  Hi) 

1.  Let  -  log  P^^ 

2 .  Then 

a.,    u  -  i  J 

*  i.J  ^ 


^-  "i(i)  "if      "  i!^ 


«i2(i3)"^ir  2  f  *irn  'id*^i'3*i3 

111.    Solution:    TestiSg  for  Independence  or  whether  U^^cij)  " 
for  ali  i.J. 

1.    X2  -  [(Xjj  -  E(x^j))2/E(xij)^ 


where    E(xjj)  -  Np^^  p^ 


+  "l(i)  *  "2(J) 
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befinttion;    leoodman's  Measure  (ilkelihood  ration  statistic) 

Both  distributed  under 
%2(lj)"  °^ 

2 

as  X  .random  variables ,  1  df 


9fRf 

Lecture  9-3.    two  dimensional  contiiigency  tables 
Fitting  Models  to  Two  blmensionai  Contingency  Tables 

Lecture  Content; 

1.  Structure  of  two  Dtoensional  Contingency  Tables 

2.  Lbg-llnear  inodels  for  two  diiDensional  tables 

3.  Independence  of  the  Variables 

Main^oplcs ; 

1.  Log-linear  inodels 

2.  Testing  the  fit  of  the  TOdel 

(There  are  no  transparencies  for  this  lecture) 
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topic  i.    Lbg-iinear  iwdels 

I.    Basic  Issue:    Structure  of  tbe  data 
1.    V^table  A^:    i  categories 


2.    Variable  J  categories 


1 
2 


colucm 
margins 


row 
margins 


Table  of 

 9bserved_ 

frequencies 


X|  I  =  N  sample 
size 


Si  Let  pj^j  «  ^ij^^  ^iohservBttoa  fails  in  ceil  (i;j)} 
4*  m 


5. 


"  Expected  nundber  of  observations  in  (i»j) 

ij 


E  m^ 

±,3 


N 


In  m- 


11 •    Method :    Log-Linear  model 

1.  Model: 

£    «  u  +  u —  +  u      +  u 

l(i)        2(j)        12  (ij) 

Saturated  model 

2.  Note  that  model  is  for  lo8^«*ij         log^(m^:|/N) ;  however^ 


they  differ  only  By  the  U  ~  "term 

3.    Using  ANOVA/neah  polish  analogy,  we  define: 

a.     Overall  mean 

1      ^    .      _  1 


^  'TJ     ^  *ij 
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h,    Haln  effect  for  variable  A^: 


U 


l(i)     J    J      ±i      "  iJ 


Main  effect  for  variable  A^: 
2(j)      I    1  ij 


9»7  • 


d. 


Two  factors  effect  (interaction)  between  variables 
(may  be  zero) :  . 


-12 (ij)  tj 
4.    Evaluating  degrees  of  freedom 


U^^rm 
U 

h 


u 


12 

5.    Other  issues 


i 

I-i 
J-i 

(i-1) (J-1) 


(1  constraint  ^  ^1^15  " 
(1  constraint  E  ^2(j)  " 
(I-l+J-1+1  constraints) 


a.  We  can  also  def toe  cross-product  ratios  and  express 
the  U- terns  as  functions  of  thetn 

b.  Can  also  consider  the  effect  of  combining  categories; 
for  example 


A. 


10-30 


Age 

30-SO^ 


 50f 


M 


Sex 


F 


B.  Becomes 


10-30 


M 


Sex 


Age 


30^ 
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Topic  2.    fedtiiig  ttib  Fit  of  the  Model 

!•    Basic  Issue:    Are  A^^  and  A2  independent? 

1.  We  fit  the  Model  containing  only  the  2  one-dl^nsibnai  * 
snarglns 

*±j  -  »  +  +  ^20) 

2.  This  ii^iies  that 

3.  We  caii  this  model  1/2,  md  compute 

=  2  Z  Z  X     log  (^) 

±  i  "ij 

2  ^  2 

^  (I-1)(J-1)  . 
to  test  H^:         &       are  independent 

II •    Secondary  Issue:    Evaluation  of  FiiE  itself 
1.    Compute  Freeman  -  tukey  deviates 


2.  "  N(0,1) 

3.  Stem-and-Leaf  display  of  the  deviates  should  be  Gaussian 
in  shape;  any  Z..  greater  than  2  in  absolute  value  is 
suspect . 
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lecture  9-4.    Three  Dlaenaionei  eontingency  Tables 
Fitting  Models  to  Three  Dimensional  Contingency  Tables 

Leetnr&  Content : 

1.  Structure  of  Three  Dimensional  Conttogency  Tables 

2.  tbg-linear  Models  for  Three  Dimensional  Tables 

Main  Topics; 

1.  Log-linear  models 

2.  Finding  the  "beSt"  model 

(There  are  ho  transparencies  for  this  lecture) 
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Topic  i*    t^g-*linear  models 

I.    Basic  Issue:    Structure  of  the  data 


a 


1.  Variable  A^:    I  categories 

2 .  Variable      :    J . categories 

3.  Variable  A^:    K  categories 


variable  A 


variable 


3 

• 

1  

2  

K 

X 

*lil 

''112     •  •  • 

'^llK 

1 

vaT*  j.abla 

2 

• 
• 

^^121 

• 
• 

^^122    •  •  • 

• 
• 

*12K 

• 
• 

j 

• 

• 

''lJ2     •   •  • 

• 

UK 

1 

variable 

2 

2 

Entries  are  x 

2jk 

• 

J  : 

variable  2 


Entries  are  x 
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il.    Hethbd:    Log-linear  nbdel 
1.    Saturated  Model: 
£ 


"l3(ik)  *  "23(jk)  "l23(ijk) 


2. 


Constraints 

I  "id) 


a. 


b.     Z  U 


E  U 


I  "3(k) 


|  >^12(ij)      J  "13  (Ik)     3    23  (jk) 


6 


civ 


f  "l2(i3)  -  I  "l3(ik)  "  I  "23ak)  "  ° 


d.    Z  U 


I  "i23(±3k)  "  I  ^23(ijk)  "ri23(W 


-  e 


3.  We  rarely  compute  these  U-terns. 
g2  to  find  best  fitting  model 

4.  Ev.Uuating  degrees  of  freedom 


We  merely  calculate 


U  term 
U 

I 

u 

"l23 


df 

i 

t-1 

j-1 
K-1 

W-15 
(i-15 (K-15 
(j-l) (K-15 
(i-i) (J-15 (K-i) 


(1  constraint) 
(1  constraint) 
(1  constraint)  ■_ 
(I+j-1  constraints) 
(I+K-1  constraints) 
(J+k-i  constraints) 
(la+IRf^K-I-J-K+l  con- 
straints) 


(2) 


5. 


TO  find  the  correct  df  for  a       of  one  of  the  8  possible 
models,  we  merely  subtract  from  UK  the  degrees  of 
freedom  for  every  term  in  the  model 
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Topic  2.    Finding  the  "Beat"  Model 

I.    Basic  Issue:    Descriptions  of  the  Models 

1.  There  are  8  relevant  log-linear  toodels,  of  4  different 
types 

2.  The  inodels,  by  type  are 

a*    Cos^lete  Independence^  Model  1/2/3. 

all  Interactions  are  zero 

Single  association  models* 

Models  12/3,  13/2 s  23/1  all  but  one  2  factor 
interaction  is  zero. 

i.     12/3:  log  .^j^  -  U  +  U^^^^  +  U^^^^  +  U^^^^  +  U^.^.^^ 
li.     13/2:  log  .^^^  .  U  +  U^^^^  +  U2(35  +  U3^^^  +  V^^^ 
ill.     23/1:  log  .  U  +  Uj^^^  +  U^^^^  +  U3^^^  +  V^^^^^ 

c.    Conditional  independence  tnodels. 

Models  12/13,  12/23,  13/23   

Conditional  on  the  level  of  the  variable  included 
in  the  two  interactions,  the  other  two  variables 
are  Independent 

i.     12/13:  log  m.^^^  -  U  +  U^^^^  +  U^^^^  +  U^^^^  +  U^^dd)'' 

"l3(ik) 

li.     12/23:  log  m-^^  -  U  +  V^^^  +  Uj^^^  +  +  U^2(ij)  ' 

"23  (jk) 

iii.    13/23:  log  «>ijj,  -  U  +  +  U2(3)  +  +  V^^^i^  H 

U 

23  (jk) 
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d.    Ho  three  factor  interaction  model 
Model  12/13/23 

log  -  U  *  Ui(ij  +  U^^jj  *  U3(j^j  +  Uj2(ij)  + 

"l3(ik)      ^23  (J k) 

li.    Problea:    How  do  we  coi^ute  cell  estimates 

1.  All  but  the  12/13/23  model  have  "closed  form"  cell 
expected  values 

2.  The  cell  estimates  for  12/13/23  model  must  be  found  by 
Iterative  Proportional  Fitting  (3) 

3.  Cell  estimates,  m^^^  are 
"ijk  "  p"  ''++k 

°ijk  "  N~  ""-^^ 
"ijk  °  I  *i+k  "Vj* 

i 

°ijk  "  x^tf  ""^3+ 
"ijk     ^  "-^^^ 

°idk  '  ^  "^-^^  "^^^ 
h.    12/i3/23:    m..^  found  by  iterative  proportional  fitting 

4.  Main  task  is  to  determine  ^tch  model  fits 

111.    Solution:    Hypothesis  tests  for  each  model 

1.    For  each  of  the  8  models,  we  have  a  null  hypothesis  that 
the  model  is  an  accurate  description  of  the  data 


a.  1/2/3: 

b.  12/3  : 

c.  i3/2  : 

d.  23/1  : 

e.  i2/13: 

f.  12/23: 

g.  23/13: 
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2.    We  conpute  a       for  each  model,  and  determine  whether 


n2  ^ 
G**  > 

2  . 

X.,     where  df  is  as  follows 
d£;a 

a. 

1/2/3: 

G^  has 

IJK-fI+J+K^+2  df 

b. 

12/3: 

g2  has 

(IJ-lHK-l)  df 

c. 

13/2: 

G^  has 

(IK-1)(J-1)  df 

d. 

23/1: 

G^  has 

(JK-1)(I-1)  df 

e. 

12/13: 

G^  has 

(J-1) (K-1) I  df 

f . 

12/23: 

g2  has 

(I-l) (K-1) J  df 

23/13: 

G^  has 

(I-l) (J-l)K  df 

h. 

12/13/23:         has  (I-l) (J-1) (K-1)  d£ 

3.  Strive  for  simplicity:    if  2  models  fit,  choose  the  less 
saturated  of  the  two 

4.  Calculate  Freeman-Tukey  deviates  for  the  best  fitting 
tnddel  and  examine  them 

5.  Rearrange  table  to  emphasize  fit 

6.  Examine  relevant  2  dimensional  margins 

(Thoroughly  discuss  example)  (4) -(9) 
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T0  M        jfer  C"? 


y  i  kernes  ff  -frceio/ij 


x-i 
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Homework,  Unit  9 
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1.  The  number  of  ahithal  bites  reported  in  three  successive  weeks  in 
1967  to  the  Chicago  Boarfl  of  Health  were  as  follows: 

Week  1  2  3 

Nuiber  of  Bites  268  189  199 

a.  Test  the  null  hypothesis  that  the  weeks  are  identical. 

b.  Explain  why  the  uull  hypothesis  in  part  (a)  was  or  was  not  rejected, 
How  could  differences  or  similarities  in  the  three  weeks  produce 
this  finding? 

2.  We  have  taken  a  random  sample  of  148  retarded  children  and  recorded 
their  IQ  score  (dichotomous :    55-69  and  40-54)  and  season  of  birth. 

Birth 

IQ  Sunsner  Autumn  Winter  Spring 

55-69  29  19  12  18 

40-54  13  17  20  20 


in  this  sample,  are  IQ  and  Birth  Season  independent?  Why  or  why 
not? 


3.    Consider  the  incidence  of  leukemia  among  survivors  of  the  atomic 
bombings  of  Hiroshima  and  Nagasaki.    These  cases  were  recorded 
frcm  1950-1958. 

Doae^ia-rads  %  Bopalattbn  Exposed  Cases 

>8l  11.03  34 

21-80  13.41  5 

<20  75.56  12 

51 

a.  If  the  number  of  reported  cases  were  independent  of  the 
amount  of  radiation,  what  would  be  the  expected  number  of 
cases  for  each  of  the  three  dosage  categories? 

b.  Test  the  hull  hypothesis  that  leukemia  incidence  is  inde- 
pendent of  the  amount  of  radiation  exposure. 
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The  data  iet  to  be  analyred  for  this  problem  concerns  sex  bias  In 
Graduate  admlssioni  at  Berkele^.__The_  article  was  published  in  Science  > 
Vblune  187,  page  398^  and  is  in  the  Fairiey  and  Hosteller  collection* 

After  careful  perusal  of  the  article,  you  should  feel  that  the  authors 
hav'e  not  "done  Justice"  to  this  table.    What  we  need  is  a  log-linear 
inodei,  and,  fortunately,  the  raw  data  is  given  to  you  on  the  next  2 
pages. 

Your  assignment  is  to  find  the  best  fitting  model  for  this  2x2x100. 
table,  and  to  interpret  it.    Also  determine  whether  there  is  a  3  factor 
interaction. 

I  suggest  you  fit  all  the  possible  models...  there  are  8  of  them. 
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Berkeley  Graduate  Admissions  Data 
Fall  Quarter  1973 
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Homevbrk  Unit  9 
Solutions 


i.    a)    Animal  Bites  reported  in  3  successive  weeks  in  1967  to  Chicago 
Board  of  Health 


«0  = 


Week 
Bites 

1 
268 

2                    3  Total 
189                 199  656 

Weeks  are  identical 

Hj^:    Weeks  are  not  sittilar 

218.67 
268 
49.33 

218.67  218.67 
189  199 
-29.67  -19.67 

3  (e,~E.)^ 

i=l  H 

(49.33)'^/2l8., 

+ 

(29.67)^/218.67^4-  (19.67)^/218 

=  16.9 

2 
^^2; 

.01  =  ^-21 

Hence  we  reject  H^;  the  weeks  are  hot  idcri'uical- 

b)    We  reject  the  hull  hypothesis  because  the  probability  that  the 
data  results  in  such  an  extreme  (large)  value  of  X    under  Hq 
is  (much)  less  thah  .01.    The  differences^mong  the  three  weeks 
could  be  due  to  weather  (more  bites  in  warm  or  sunny  weather 
than  cool  or  raihy  weather)^  the  lunar  cycle,  or ^  given  data 
for  many  other  weeks.    "Week  1"  may  just  be  anoutlier,  or  the 
high  point  of  an  "animal  bite  cycle".     (It  is  difficult  to  make 
inferences  about  the  reasonableness  of  one  data  value  given  a 
sample  of  only  3). 
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2.    Season  of  Birth  and  IQ  scores 

H^:    Season  of  Birth  and  IQ  Scores  are  independent 


IQ 

Summer 

Autumn 

Winter 

Spring 

55-69 

29 

19 

12 

18 

40-54 

13 

i  7 

total 

42 

36 

32 

38 

Expected  Values  = 

Summer 

Autumn 

Winter 

Spring 

IQ 

55-69 

22.1 

19.0 

16.9 

20.0 

40-54 

19.9 

17.0 

15.1 

18.0 

=  7.98 

2 

^  3;. 05 

7.82 

=11 

^  3;. 01 

.  The 

observed 

  __  2 

value    of  X 

ts  just  significant  at  the 

total 
(x,^) 

78 
7b 


148 
(x^) 


level,  hence  we  reject  Hq  at  the  5%  level.  Note  that  we  accept 
Hq  (Independence)  at  the  1%  level. 
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3.    Incidence  of  lieukemia  among  survivors  of  the  atomic  bombings  of 
Hiroshima  &  Nagasaki  (1950-1958) 

Dose  in  Rads  %  Poptiiattofi  Exposed^  Cases 

81+  11.03  34 

21-80  13.41  ^  5 

0-20  75.56  i2 

51 

H^:    Leukeir^ia  incidence  independent  Of  ambuht  Of  exposure 

Under  Hg,  11.037.  of  the  reported  cases  would  be  in  81+  rad  group, 
13.417.     of  th^  cases  in  21-80  group,  and  75.567.  in  0-20  group. 

.1103  X  51  *  5.63  .1341  x  51  =  6.84         .7556  x  51  -  38.54 

0^  34  5  12 

=  161.73  huge 

*2?.05  "  5.99        x|..0i  -  9.21 

Hence  ve  reject  H^;  the  Incidence  of  leukemia  is  not  independent 
of  exposure. 
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4.     Thie  saturated  model  123  Is 

Hjk  "  ^  '^"'Ki)    ^^(j)  *  ""ack)    ""12 Hi)    ""isim    ''23(jk)  ""majk) 

where:    i    (*  1,  2,  100)  refers  to  department, 

j     («  1,2)  refers  to  sex, 
k    («=  1,2)  refers  to  a&it/deny. 

We  fit  all  8  models 

1/2/3 
12/3 
13/2 
23/1 
12/13 
12/23 
13/23 
12/13/23 

a.}     (1/2/3)  complete  independence 

Under  this  model  there  is  no  association  between  atiy  pair  of 
variables,  tior  among  ait  three  together. 

Department,  Sex,  and  Admission  decision  are  all  independent. 

_  _         ^     0  -----  -  — 

The  observed  X    statistic  was  5703 

_  _____    o  ^j_  _   

The  observed  G    statistic  was  5688 

The  model  has  (fJK  -  1  -  J  -  K  -F  2)  «  298  degrees  of 

freedom,         _  «  |  (  +  1-65)2  ^  399.11    Hence  we 

•  05 

reject  this  model. 
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b)  (1,2/3) 


Under  this  model  ^^^issionts  independent  of  sex  and  department. 
Sex_  atid  department  however  are  associated  indicating  a  tandencj-  for 
ea.ch  sex  to  be  more  strongly  ^icferred  by  some  depjirtments  at  the  expense 
of  other  departments. 


The  observed  X    statistic  was  2262 

2  -  

The  observed  G    statistic  was  2420 

The  model  has  (IJ-l)(k-i)  =  199  degrees  of  freedom 
\05  "  h  ^/397      i-65)2  =  232.74 
Hence  we  reject  this  model. 


c)  (1,3/2) 


Under. this  model  sex  is  independent  of  admission  and  depart^ 
ment.    Admission  aiid  department  are_  however  associated,  indicating 
a  tendency  for  some  departments  to  be  easier  to  enter  than  expected, 
while  other  departments  are  harder  to  enter  than  expected. 


The  observed  X    statistic  was  3118 

-  2 

The  observed  G    statistic  was  3428 

The  model  has  (IK*1)(J-1)  =  199  degrees  of  freedom 

X%5  =  i  ^^/397  +  1-")^  ^  232.74 
Hence  we  reject  this  model. 


d)  (2,3/1) 


Under  this  model  department  is  independent  of  admission  and 
sex.    Admission  and  sex  are  associated,  indicating  ah  overall  pattern 
of  sex  discrimination.    Closer  examination  will  suggest  whether  this 
discrimination  favors  males  or  females;  the  mere^iMr es enee  of  the 
term  does  not  indicate  which  Sex  is  favored.  23 


The  observed  X    statistic  was  5407 


The  observed  G    statistic  was  5585 
The  model  has  (JK:-l)(i-l)  =  297  degrees  of  freedom 
^^05"2<n/593  -^^-^^^^  "338.04 
Hence  we  reject  this  model  z_ 
XVI i IV;  143 


e)  (1,2/2,3)     (conditional,  independence  of  variables  I  and  3) 

Ihider  this  model,  departaent  and  sex,  and  admission  and  sex 
are  condttionaiiy  independent  (for  a  given  sex,  department  and 
admission  are  independent).    Hence  for  males  (and  also  for 
females),  individuals  have  equal  chances  for  admission  to 
each  department. 

The  observed       statistic  was  2173 

-  _      __     _    -2  -  -  - 

The  observed  G    statistic  was  2316 

The  model  has  (J) (I-l) (K-1)  =  198  degrees  of  freedom 

Hence  we  reject  this  model. 

f)  (1^3/2^3)      (conditional  independence  of  variables  1  and  2) 

Under  this  model  department  and  admission^  and  admission  and 
aex  are  associated.  Department  and  sex  are  conditionally  indepen- 
dent (for  a  given  admission  decision,  department  and  sex  are 
independent) . 

2   

The  observed  X    statistic  was  3044 

_  2   

The  observed  G    statistic  was  3324 

The  model  has  (K) vI-1) (J-1)  «  198  degrees  of  freedan 

Hence  we  reject  this  model 

g)  (1^2/2,3/1*3)     (No  3  factor  interaction) 

Under  this  model  each  pair  of  variables  is  associated,  but 
together  independent  of  the  third. 

    2  -    

The  observed  X    statistic  was  151 

    2   

The  observed  G    statistic  was  155 

The  model  has  (1-1) (J-1) (K-l)  «  99  degrees  of  freedom 

x!o5  4(^/T97^  1-^^)2.  123.02 

Hence  we  reject  this  model.  .  _ 
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h)     (1,2/1,3)     (conditional  ihdepehdetice  of  variables  2  and  3) 

Under  this  model  department  and  sex,  and  department -and 
a&ission  are  associated.    Admission  and  sex  are  conditionally 
independent  (for  a  given  department,  admission  and  sex  are  inr 
dependent).    There, is  a  tendency  for  some  departmentsto  attract 
more  male  (or  female)  applicants  than  otherwise  expected  and  a 
tendency  for  some  departments  to  be  harder  to  enter  than  other- 
wise expected.    However^  the  decision  in  each  department,  to 
admit  or  deny  an  applicant  is  independent  of  the  applicant's 
sex*    This  is  an  extremely  important  conclusion. 

The  observed        statistic  was  156 

2  -   -  - 

The  observed  G    statistic  was  159 

The  model  has  (I) (J-1) (K-1)  =  100  degrees  of  freedon 

Hence  we  reject  this  model 

However,  this  model  fits  better  than  the  other  7  models^  with  the 
possible  exception  of  12/13/23.     This  model  is  preferred  because 
it  is  more  parsimonious  than  12/13/23. 


Conclusion: 

Of  all  the  models,  the  "best  fit"  was  achieved  by  model 
(1,2/1^3),  (h),  conditional  independence  of  variables  2  and  3.  The 
fit  and  the  Freeroan-Tukey  residuals  for  this  model  should  now  be  com 
puted.    We  rearrange  the  data  into  100  2x2  small  tables.    Each  has 
the  structure: 

Admit  Deny 


Department  i 

Female 


for  all  departments  i  =  1^2,. ..^  100. 
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within  each  of  these  tables,  the  sex  and  admit /deny  variableis  are 
Independent ; 

-    ^  ^411^122  ^  . 

i.e.         ^4      ~  ^   ^  * 

^      ^±12  ^122 

We  next  give_ the  stOT-and-lea£  and  boxplot  of  the  F-T  residuals. 
suppress  displaying  the_ 100x2x2  array  of  the  rearranged  data  and 
residuals  because  of  lack  of  space. 

The  Btem-and-leaf  display  of  the  residuals  is  syimrietrxc>  and 
indicates  that  there  arevery  few  deviant  cells.    In  fact ^  we  see 
only  7  cells  >  1.96  in  absolute  value,  a  number  much  smaller  than 
the  .05(400)  =  20  expected  by  chance.    Perhaps  the  variance  of  the 
residuals  is  smaller  than  1. 


11  OQ 
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BLE  '  RES 
»  O.dlOO 


to  I  -"3,5642         -2»7362         -2,6693         -2,05Jl         -  -l,5a76         -1.4176  -1.2B42 

LO  I  -1.2542         -1.2460         -1.1913         -1*1800  ;219  -1.1156 

-10    I  641 
-9    I  0 

-a    I  87444210 
-7    I  754100  

-6    I  88655444332200  .      :  : 

-5  '  I  8733320  -  -  

-4    I  964444333. 

-3    I  99B7266555444442110000:  I 

-2    I  98B65443333333321111111000J 

-1     I  999987766654443333200000000: 

-0    I  87776666655^^433133221111000  

0  I  obodooooooooo  J222222334455555567B888999 

1  I  ddll22333344A4A**44556666777788 

2  I  000011233334555567888888899 

3  I  00000011111112223333444555666666667778899 

4  I  11111112334456677789999 

5  I  001112223355666679 

6  Z  0013445688999 

7  I  233336789 

8  I  0001388 

9  I  0034489 
16    I  03379 
It    I  2247 

12-  I  3  __  _  _  ~   

HI  I  1.2798  1.3622  1.4135  1.4280  1.6604  1.8300  2.0604  2.2110 

HI  I  2.6020 
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Quiz,  Unit  9 

Name   -  .    -  -  -   


Write  all  your  answers  on  these  pages.     Point  totals  are  given  in 
parentheses  prior  to  each  question.    You  have  sixty  (60)  minutes  for 
this  quiz.    Good  luck! 


(40)  1.  You  are  interested  in  the  relationship  between  family  income 
and  the  number  of  children  per  family  in  a  small  township  of 
iOOd  families « 

Unfortunately,  you  do  not  have  available  the  family  income 
and  number  of  children  fpr  each  family.    Your  research 
assistant  has  been  able  to  gather  from  each  family  only 
answers  to  the  two  questions: 


Is  your  family  income  (annual)  greater  than  $8000  or 
less  than  $8000? 

Does  your  family  have  3  or  more  children,  or  2  or 
fewer  children? 


a.    Explain  brief ly  to  your  township  supervisor  what  you  have 
implied  by  the  statement: 

V'Based  on  this  two  dimensional  conting 

»_ children  per  family  is 
independent  of  the  annual  family  income." 
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b.    Your  research  assistant  has  isispiaced  the  actual  ceii  counts 
for  the  table,  but  has  managed  to  keep  the  one*dimehsional 
margiiiB  of  the  table  in  his  head. 

The  margins  are 

lamily  income 


<  $b060       >  $8000 


<  3 

550 

children 

>  3 

450 

300 

700 

1  1000 

Amaze  your  research  assistant  by  computing  a  two  dimensional 
table  from  these  one  dimensional  margins  that  exhibits  no 
interaction. 


c.    Fortunately,  your  stramer  intern  has  filed  away  the  actual  cell 
counts.    She  says  that  the  observed  frequencies  are: 

family  inc<nne 


<  $8000       >  $8000 


< 

3 

200 

 35u  i 

children 

> 

3 

100 

350  1 

Please  construct  atid  test  a  hypothesis  for  your  supervisor  for 
no  interaction  in  the  observed  table.  Use  a  «  *025.  Is  there 
any  interaction  present? 
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(50)         you  are  Interested  in  studying  what  patient  characteristics 
influence  the  length  of  stay  of  hospital  patients  after 
surgical  procedures  Have  been  performed  on  them. 

You  have  data  on  10000  surgical  patients  from  Massachusetts 
General  Hospital  in  1976. 

The  patients  are  placed  into  ceils  of  a  4  dimensional 
contingency  table  with  variables: 

Var.  1:     Length  of  Stay,  4  categories: 

<  1  day^     Iday  -  1  week, 

1  week  -  1  months  >  1  month. 

Var.  2:    Age,  3  categories: 

<  30  years i  30-50  years,  >  50  years 

Var.  3:    Sex,  2  categories: 
Male^  Female 

Var.  4:     Preoperative  status^  4  categories: 

1  =  excellent,  2  =  good,  3  =  fair,  4  =  poor 

Ydii  use  the  stepwise  procedure  to  fit  loglinear  models  to  this 
4x3x2x4  table. 

The  first  stage  of  the  fitting  process  yields  the  following 
results : 

Model  df 

1/2/3/4  86  375.4 

12/13/14/23/24/34  57  41.6 

123/124/134/234  18  9.3 

a.    What  can  you  conclude  from  these  results  with  regards  to 
choosing  the  "best -fitting"  log  linear  model? 
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You  fiud  the  foUbwing  test  statistics  for  each  of  the  2 
factor  Ititeractlbhs: 


Interaction  Conditional  G  df 

12  193,1 

13  4,6 

14  i25,3 

23  3-2 

24  5.7 
34  1.9 


Fill  in  the  df  column,  and  determine  (roughly)  which  2 
factor  interactions  are  non-zero. 

2  ------  -  -  -  2 

Note:    Conditional  G    statistics  aredifferences  of  G 

s|ati8tici^  for  specific  models.    For  example 
^1/2/3  "  ^12/3  "  ^fl2]  "  ^^^'^ 


Based  on  these  results,  write  down  the  "best"  loglinear 
model  for  this  table  in  terms  of  the  appropriate  U-terms, 
and,  in  the  context  of  this  exmple,  interpret  it. 
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(10)  3.    You  are  interviewing  individuals  at  random  in  the  conrounity  to 
detemlne  preferences  for  home  energy  consumption. 

You  ask  each  individual  which  of  the  following  4  energy  alter- 
natives he/she  prefers: 

Natural  Gas 

Oil 

Coal 

Solar  Power 

The  1000  individuals  sampled  have  the  following  preferences: 

#  individuals 


Natural  Gas 

Gil 

Goal 

Solar 


1000 


Test  whether  individual  preferences  are  uniformly  distributed 
among  these  4  alternatives. 
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Solutions 


i,    a.    since  this  answer  is  directed  to  a  layman  of  both  exploratory 
data  analysis  and  statistics^  perhaps  the  simplest  and  most  ef- 
fective response  would  be  to  say: 

"Independence  implies  that  the  variables  do  not 
influence  one  mother  in  a  consistent ^  measurable 
way._  It  is  not  possible  to  predict  with  accuracy 
the  nunier  of  children  in  e  family  based  upon  their 
income." 

A  more  technical  response  i^ich  would  assume  a  background  in 
the  subject  would  be  to  say  that: 

''The  contribution  of  one  category  of  a  factor  does 
not  help  define  the  cbntributioh  of  any  category  of 
the  other  factor,     in  other  words,  the  probability 
of  a  given  observation  falling  in  particular  cell  of 
the  table  Is  equal  to  the  product  of  the  marginal 
probabilities." 

The  form  of  the  table  of  raw  data  can  be  des- 
cribed by  the  form: 

children 

>2  <3 


<  8,000 


Inc  jfme 


>  8,000 


a 

b 

« y 

ax 

by 

where  x         y  represent 
values  (predicted  or 
observed)  of  the  :able 
and  a  and  b  are  positive 
constants 


b.    The  model  of  independence  in  the  table  is  given  by 

log   m.  .  -  U  +  U    +  U«       where  m^^  is  a  predicted  cell  value 
1  and  U  is  the  grand  mean 

is  the    additional  contribu- 
tion to  the  grand  mean 
associated  with  the  first 
variable 
U    the  additional  contribution 
2  of  the  second  variable 
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m^^  ±&  also  given  by  the  equation  * 


where  each  x.^  ±b  a  row  sum 

each         is  a  colutnn  sutn 


X|-p  is  the  sum  of  all  cell  values  (all  x^j 's) 
The  solution  (given  the  marginal  sums)  is: 


J? 

children 


<  3 


>  3 


>^$8,QQe 


(550) (3Cb) 
1,000 


165 


(450) (300) 
1,000 


135 


300 


(500) (700) 


1,000 


385 


(450) (700) 


1,000 


=  315 


700 


550 


450 


1,000 


c.  The  "null"  hypothesis  to  be  tested  is  that  there  is  no  inter- 
action between  family  income  and  the  number  o£  children  (i.e. 
U,2=0).  ' 

.......  ...  2 

To  test  this  hypothesis  we  use  the  X  formula: 

.  _ .  _  ...  .     . « 

2  ^  _  (Observed  value  -  Expected  Value) ^ 

^  Expected  Value 

We  calculator  i  the  expected  vaJ  es  in  b  above  under  the 
assutnption  of  Independence  (i.e.  U-j^2  "  0)*  observed 
values  are  given  in  this  question. 

L  i:  2 

The  respective  --^xi^. )      values  for  tae  table  are 

Exp. 


y  .4? 

3.18 

9.07 

3.89 

23.58 


^  J,  (Obs-Exp)    ^  ^2 
Exp. 
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Using  d  -  .05  for  a  one-tail  test^  we  find  z  «  1.65. 

Since  723.58  »  4.86  >1.65  the  null  hypothesis  is  rejected. 

Our  conclusion  is  that  ah  interaction  between  the  variableis 
probably  does  exist  and  that  our  assumption  of  independence 
was  probably  incorrect. 


2.    a.    By  comparing  the  hutnber  of  degrees  of  freedom  (d.f. )  with  the  G 
value  for  eech  model  we  can  discover  which  model  fits  best. 

The  model  of  independence  (1/2/3/4)  is  apparently  much 
too  simple;  the       is  almost  four  times  the  number  of 
degrees  of  freedom. 

The  model  of  all  two  factor  interactions  displays  a 
to  d.f.  ratio  of  .73  which  is  closest  to  1  of  any  of 
the  three  models.    However,  as  shoym  by  the  ratio  of 
less  than  one,  the  model  is  "over fit".    A  simpler  model 
might  be  preferred. 

The  model  of  three  factor  interactions  its  much  too  complex 
with  a  G2/d.f .  ratio  of  about  .52. 

The  12/13/14/23/24/34  is  therefore  the  "best-fitting" 
model  of  the  three  given.    Partitioning  of  this  model 
would  be  advised  to  get  a  r ;  aq)ler  model  which  is  not 
overf it . 

b.    To  calculate  the  degrees  of  freedom  for  each  interaction  we 
mu!?t  consider  the  number  of  categories*  for  each  variable 

Variable  1  has  4  categor^  ^  Let  1  =  4 

2  "  3           "  J  »  3 

3  II  2           "  K  =  2 

4  "  4           "  L  «  4 

Since  the  question  asks  fbrthe  number  of  degrees  of  freedom 
for  each  interaction  the  calculations  are  straightforward 
multiplications  of  the  degrees  of  freedom  of  the  involved 
variables.    Therefore  the  result  is: 

Interaction  g£  d .f^ 

12  193.1  (I-1)(J-1)  «  6 

13  4.6  (I-.1)(K-1)  =  3 

14  125.3  (I-.1)(L"1)  «  9 

23  3.2  (J-1)(K-1)  =  2 

24  5.7  (J-1)(L-1)  =  6 
34                       1.9                 (K-1)(L-1)  «  3 
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The  queitibS  also  asks  what  two  factor  interactions  ar^  "Tff°*. 
the  sheer  magnitude  of  the       makes  it  apparent  that_12^and  5.4  are 
non-zero.    They  are  so  overwhelming  that  inclusion^o^ the _others 
would  probably  contribute  little  to  the  goodness  of  fit  of  the 
tBodel. 

c.    The  best  log-linear  model  is  given  by: 

log  m^j^i  =  U  +       +  U2  +  U3  +       +  + 

tength  of  stay,  age,  sex,  and  preoperative  status  each^ 
contribute  in  describing  particular  types  of  patients  ^nd  the 
frequency  which  they  can  be  expected  to  be  observed.    In  ad- 
dition, there  is  a  relationship  (an  interaction)  between 
length  of  stay  and  age,  and  between  length  of  stay  and  pre- 
operative status.    Although  we  do  not  know  for  sur^  what  ^ 
these  relationships  are,  We  can  postulate  that  perhaps  the  very 
yS  or  ?ery  old  normally  require  extra  care  in  their  treatment 
and  therefore  generally  stay  longer.    Similarly,  . 
reasoned  that  the  worse  the  preoperative  status  of  the  patient 
the  longer  he  or  slie  will  have  to"  stay  in  the  hospital. 

There  are  no  other  significant  two  factor  i"^"^;;^^"^^ 
are  there  any  three  factor  relationships  which  help  much  in 
describing  the  patient  population. 

3      If  individual  preferences  of  energy  alternatives  are  uniformly 
distributed    each  category  would  have  the  same  number  of  o^ser- 
v'a  ions    negllcfing  sampling  error)      This  means  that  f^xpected 
values  for  each  of  the  four  alternatives  is  250  (i.e.  -iU).  The 
null  hypothesis  is  =  M2  "  ^3  "    ^4*  „ 

2  ^        ,  _  V /Observed  -  ExpectQd) 

Ilsiug      -J  X    formula    X    -  Expected 
I.      .^e  X    -  20. 

mVh  ^  decrees  of  freedom*  at  a  95%  level  of  significance,  a  table 
Saiuf  of  "siS  is  found.    Since  20  >  7.815.  the  null  hypothesis 
is  rS^ted.    we  cannot  assume  that  the  preferences  are  uniformly 
distributed. 
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Final  Examination 
Second  Term 


Name;   -    - 

All  answers  should  be  written  on  this  test.    Total  point  score 
is  100.    You  need  not  answer  every  question.    Please  read  the  examina- 
tion before  you  begin  writing.    You  have  2.5  hours  to  complete  this 
examination. 

Part  i.    This  section  is  worth  40  points.    Answer  all  3  questions  (#1,  2  and  3) 


(1)    You  are  interested  in  constructing  a  linear  model  relating  the 
number  of  phartnacists  per  census  tract  to  other  policy  relevant 
features  of  the  tract. 

a  •  ^se  leaist  squares  tiechnlques    o  estimate  your 

model.  Consequently,  you  suspect  that  two  problems  may  arise 
because  of  these  correlated  variables. 


(a)  Ttie  residuals  will  not  be  independent. 

(b)  Their  coefficient  estimates  will  not  be  reliable. 

(c)  The  F-test  will  producie  a  -  -alt. 

(d)  Both  coefficient  estimate.  large  t-staf istics. 

(e)  The  computer  may  have  pfoblea/.  with  (X'X)"''^. 

(f)  The  coefficient  of  determination  will  be  indetierminant . 

b.    When  you  tell  your  supervisor  about  yo^r  intentions  to  use 
least  squares  to  estimate  the  model  ahe  as'-^  what  this  means. 
Your  reply  is  that  it  uses  one  specific  mir.imlzatlon  criterion 
which  is:  minimize  


-  V 

(b) 

e|y^ 

Cc) 

-  Y^)^ 

(d) 

m\ 

(e) 

None 

of  the 
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You  cdritihue  your  explahati bh  by  saying  that  ,  "If  the  asstimp- 
tibns  vnderlying  leas*-  squares  hold,  then  this  procedure  yields 
optimal  earitmBteB  of,  the  coefficients".    What  are  the  assump- 
tions? 


d.      "But  in  what  sense  are  least  squares  regression  lines  optimal?" 
she  asks.    You  reply... 


"Ok"^  your  colleague  says^  "so  they  are  optlBial  when  the  

asBumptibns  hold.    But  suppose  for  our  data  the  assumptions 
do  not  hold.    Whet  does  this  imply  vlth  regards  to  the  dis- 
tributions of  the  standard  set  of  test  statistics  that  we 
always  compute? 
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Ybiir  supervisor  states  that  5%_of  the  census  tracts  In  _ 
Pittsburgh  have  median  family  size  greater  than  6  Individuals/ 
family.    In  disbelief ,  you  gather  data  on  the  86  census  tracts 
.^nd  find  that  median  family  size  per  tract  is  remarkably  well 
behaved,  with  y  -  &.5  and       «  -20.    Is  your  supervisor 
correct?    Why  or  why  not? 


The  computer  center  at  Rohbjer  Baron  University  claims  a  95% 
availability  for  their  HAL-25Q  computer.    You  are  somewhat 
skeptical  of  this  statement  ,_so  you_gat:her  data  for  the  30 
days  that  you  used  the  system  for  your  latest  paper.  You 
calculate  the  average  availability  to  be  85%  with  associated 
standard  deviation    £    of  5%. 

(i)    Construct  a  95%  confidence  interval  for  the  true  percentage. 
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(il)    Based  on  this  interval,  state  and  test  a  hypothesis  (a  «  .05) 
■  to  determine  the  truth  of  the  computer  center's  assertion. 


(iii)    Are  the  distributional  assumptions  that  you  made  to 
test  the  hypothesis  In  (ii)  appropriate?    Why  or  why 
not? 
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3,    You  are  Interested  in  studying  what  patient  charecteristics 
influence  the  length       r^-*v  of  hospital  patients  after 
surgical  procedures  ha  performed  on  them^ 

You  have  data  on  10000  .utgical  ps^tients  from  Massachusetts 
General  Hospital  in  1976. 

The  patients  are  placed  into  cells  of  a  4  dimensional 
contingency  table  with  variables: 

Var.  1:    Length  of  Stay,  4  categories: 

<  i  day,     1  day  -r  1  week, 

1  week  -  1  month,  >  1  month. 

Var.  2:    Age,  3  categories: 

<  30  years,  30-50  years,  >  50  years 

Var.  3:    Sex,  2  categories: 
mie.  Female 

Var.  4:    Preoperative  status,  4  categories : 

1  =  excellent,  2  =  good,  3  =  fair,  4  «  poor 

You  use  the  stepwise  procedure  to  fit  loglinear  models  to  this 
4x3x2'<-  table. 

The  first  stage  of  the  fxctiUte  process  yields  the  following 
results : 

-  2 
Model  di  G_ 

1/2/3/4  86  375.4 

12/13/14/23/24/34  57  41.6 

123/124/134/234  18  9.3 

a.    What  can  you  conclude  from  these  result?  with  regards  to 
choosing  the  "best-fitting"  log  linear  model? 
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You  find  the  following  test  statistics  for  each  of  the  2 
factor  interactions: 


Fill  in  the  d£  cblumn^  and  determine  (roughly)  which  2 
factor  interactions  are  non-zero. 

5  2 

Note:    Conditional _G    statistics  are  differences  of  G 
statistics  for  specific  models.    For  example 


Based  bh  these  results,  write  down  the  "best"  loglinear 
model  for  this  table  in  terms  of  the  appropriate  U-terms, 
and,  in  the  context  of  this  example,  interpret  it. 


Interaction 


Conditional  G 


12 
13 
14 
23 
24 
34 


193 . 1 
4.6 

125:3 
3.1 
5.7 

i.9 
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Part  il.    This  section  is  worth  40  points*    Answer  5  of  thie  7  questions, 


i.    if  s'**^"'^ne  asked  you  to  describe  how  exploratory  and  cuiifirmatory 
techniques  differed  and  what  they  were  used  for^  how  would  you 
respond? 


2*    What  feature  distinguishes  a  table  on  which  you  would  perform  a  two- 
way  analysis  from  a  table  to  which  you  would  fit  a  log--xi /^ar  model? 
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3.  What  does  ah  extended  fit  ihcbrpbfate  that  a  slihple  tvb^^y  analysis 
doea  not? 

Choose  one: 

Ca^    additive  differences 
(b^    multiplicative  effects 
Cc^    coIuhbi  medians  of  zero 
(d)    a  Uj2^ij)  interact ibh  term 

4.  Suppose  that  a  sample  of  voters  in  a  certain  district  were  selected 
by  choosing  every  hundredth  person  from  the  list  of  registered  voters 
and  including  that  person  and  his/her  spouse  in  the  sample.  Would 
this  be  a  random  scample?    Why  or  why  hot? 


5.    After  performing  4  half-steps  of  medlah  polish^  your  two-way 
table. of  bordered  ahd  ordered  residuals  shows  large  pb&itive 
values  in  the  upper  left  and  lower  right  corners  and  large  hega-_ 
tlve  values  Ih  the  upper  right  and  lower  left  corners.    Tbu  should: 

Choose  one: 

(a)  Go  to  ah  extehded  fit 

(b)  Perform  more  half-steps  of  median  pdlish 

(c)  Return  to  the  drigihal  data  and  perform  mean  polish 

(d)  Perform  dh       test  with  the  observed  and  expected  frequencies 
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6.    Wiat  does  a  95Z  confidence  Interval  about  p  meant 


7.    What  is  the  definition  of  the  ekpectatton  of  a  continuous  random 
variable  if  f (x)  is  its  probability  density  function? 


U4S 
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Thfe  following  questions  are  worth  26  points.    All_ refer  to  chapters  in_ 
In  the  Book  by  Fairley  and  Mosteiier^    There  are  10  questions  each  with 
five  pbasihle  answers,    Cixcie  the  one  Best  answer.    If  you  haven ^t  done 
the  readings  and  you  gueas  at  the  answers  how  many  points  would  you  gain 
on  average? 


The  following  questions  refer  to  Fairley 's  paper;  "Accidents  on  Route  2". 


1.     In  his  initial  expiorationof  the  data  on  accidents  Fairly  used  a 
8tem-and-ieaf  display  and  concluded  that: 

a.  quarterly  totals  of  accidents  could  not  be  predicted  accurately 
by  simply  using  an  average  value. 

b.  the  count  of  accidents  by  qua^  be  transformed  by  _ 
taking  its  log  and  then  predictions  would  be  straightforward 


c.  the  data  were  remarkably  symmetric 

d.  missing  values  precluded  any  classical  analysis 


e.    a  regression  using  least  squares  estimatibh  wbuldyield 
unacceptabiy  low  t-statistics  for  the  time  variables 


When  exploring  year  and  quarter  effects  simultaneously  Fairley 
tried  an  additive  model  and  the  following,  procedure  to  fit  it: 

a.  Log-linear  contingency  table  analysis 

B.  Linear  risgresslbh 

c.  Median  pblish 

d.  Extended  fit 

e.  Nbhe  of  the  above 
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3i    He  also  tried  a  nniltlpllcatlve  inbdei.    this  invoived 
a.    testing  for  Interact Ibtis 
'    b.    multiplying  marginals 
Cf    adding  ah  extended  fit 

d.  iultiplylhg  by  the  cbhditldnal  typical 

e.  none  of  the  above 

The  following  questions  refer  to  the  chapter  "A  Statistical  Search 
for  Unusually  Effective  Schools"  by  Kiltgaard  and  Hall. 

4.    They  used  regression  primarily  as 

a.  a  confirmatory  procedure 

b.  ah  exploratory  procedure 
c«    an  inferential  procedure 

d.  ah  experimental  procedure 

e.  an  effective  procedure 

5.    The  policy  i^iicatibn  that  they  derived  from  their  study  was: 
a*    Re  heed  to  build  more  effective  schools 

b.  Uhusually  effective  schools  cannot  be  produced 

c.  Studies  of  educational  effectiveness  should  focus  on  classrooms 
and  programs 

d.  Studies  of  educational  effectiveness  are  doomed  to  failure 
because  of  cbllhearity  problems 

e.  Rural  schools  are  more  effective  then  urban  schools. 
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The  fbliovlng  quest  iota  refers  to  the  chapter  by  Shepard  on  "the  Wait 
to  See  the  Doctor". 


6.    Using  a  two-way  analysis  he  concluded  that» 

a.  Doctor  workload  was  significant  at  the  5%  level 

b.  tate  startup  of  the  clinic  could  not  be  represented  in  a 
linear  model 

c.  Late  startup  was  a  more  i^ortant  factor  than  doctor  workload 
d«  Doctor  workload  was  a  more  important  factor  than  late  startup 
e.    Race  of  doctor  interacted  with  race  of  patient 


The  following  questions  refer  to  the  chapter  by  Lave  and  Seskih, 
"Does  Air  Pollution  Shorten  Lives?" 

?•    This  chapter 

a*    proves  that  air  pollution  shortens  life 

b.  shows  that  nothing  can  be  proven  using  regression 

c.  proves  that  exploratory  data  analytic  procedures  are  superior 
to  cbhf irmatbry  procedures 

d.  could  be  improved  by  ah  extended  study  of  the  sensitivity  of 
the  results  to  the  assumptibhs  of  least  squares 

e.  could  be  improved  by  ah  extended  study  of  the  sensitivity 
of  the  elasticities  to  the  trahsf ormatiohs  perfbrmed 

is.    Another  implication  they  draw  is: 

a.  in  modern  toericah,  reducing  air  pbllution  is  the  only  way  to 
lengthen  life  expect ahcy 

b.  the  elasticity  of  poverty  indicates  that  a  reduction  in  poverty 
will  result  from  a  reductibh  in  air  pbllution 

c.  regressibh  pfbcedures  should  hot  be  used  to  estimate  models  for  large 
SMSAs. 

d.  the  mbst  useful  decision  variable  is  the  minimum  level  of  a 
pbllutaht 

e.  the  mbst  useful  decisibh  variable  is  the  maximum  level  of  a 
pollutant 
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The  next  questions  refer  to  the  chapter  by  Gilbert,  Light  and  Hosteller, 
"Assessing  Social  Innovations". 

9.    They  distinguish  between  the  following  types  of  field  trials: 

a.  Expensive  and  Inexpensive 

b.  Purposeful  and  Integrative 

c.  Continuous  and  discrete 

d.  Survey  based  and  experimental 

e.  Randomized  and  nonrandomized 

id.    This  chapter  described  the  application  of  which  procedure  in  a 
policy  context : 

a.  Exploratory  analysis 

b.  Hypothesis  testing 

c.  Experimental  meteorology 

d.  Normal  deviates 

e.  Ridge  regression 
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Final  Examtnatton,  Second  Term 
Sblutlbtis 


fart  I. 

i)    a,  parts  b.  and  e.  . 

b.  part 

c.  Assumptions  are 

1.    The  model  Is  correct^  l.e.j  y  Is  a  iltiear  function  of  the 
x'b. 

11.    Residuals  are  Independent 

ill;    Residuals  are  homdscedastlc 

iv.    Residuals  are  *"  Gaussian  (O/a^) 

d;    ef  all  linear  unbiased  estimates^  the  least  squarjs  regression 
line  yields  residuals  with  minimum  variance.    The  line  Is 
"bptimai"  in  this  sense  only  If  the  four  assumptions  are  true. 

If  the  model  is  not  correct,  the  regression  coeffictents  do 
not  estimate. the  true  pppulatlbti  values.    The  coefficient  ___ 
estimates  will  be  biased,  although  still  normally  distributed. 


If  the  residuals  are  not  independent,  then  one  must  consider 
the  cbvarlances  of  y*  and       when  calculating  s^ple  distri- 
butions.   The  su^  of  squares  will  be  x- iz?^  mixtures  of  x  • 
but  the  degrees  of  freedom  are  indeterminate.  This  fact 
influences  the  distribution  of  t  statistics,  R^,  and  the 
F  statistic. 

If  the  errors  are  heteroscedastic ,  then  the  residuals  are 
not  identically  distributed.    The  sums  of  squares  will  be 
mixtures  of       ylth  varying  degrees  of  freedom,    the  regression 
cbefflcleht  estimates  will  be  linear  combinations  of  Gaussian 
random  variables.    We  will  not  know  the  linear  combinations 
br  mixtures  unless  we  know  the  variance  structure  of  the 
errbrs. 

Invalldatlbh  of  the  assumption  of  Gaussiraity_ is  the  most 
severe.    None  of  the  null  hypothesized  dtstributions  will 
obtain;  mbrebver,  it  may  be  quite  difficult  to  compute  the 
true  distributibns. 
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Module  IV 

2)    a.    Hedlah  Family  size  "  Gau  (4.5,  .20) 

-  _  TOdlOT  fi^lv  size  -  4.5  .  (q 

Pr  {median  fatniiy  size  >6}  - 

^median  jMily  size  -  4.5  ^  6  -  4.5  j  - 

vCio  /no 

p*  {z  >3.33}  <  .obi 

Our  supervisor,  who  claims  that 

P{Tfiedian  faniHy  size  >6}  »  .05    is  incorrect 

b.    i.    With  a  large  aumber  of  observations,  a  95%  confidence  interval 
is 


?  +  Z 


.85  +  1.96  (.05)  « 
(.752,  .948) 
ii.    H^:    P  »  .95 
H^:    P  1^  .95 

Since  our  confidence  interval  that  we  constructed  lii  part 
i,  (.752,  .948)  ,  does  no^contaln_ .95,  we  reject  Hq.  We 
do  not  agree  with  the  computer  center.  \ 

iii.    We  have  relied  on  the  assiSption  that  bur  data  are 

approximately  Gaussian.    However,  the  true  distribution 
is  quite  skewed  with  such  a  large  P.    In  light  of  this 
stcewness,  a  sample  size  of  36  isi  not  large  enough  to 
Justify  our  assumption. 
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3)    a.    The  %e8t-fitttng"iog-iinear  model  wiii  have  several  two  factor 
interactions,  but  shoa3:d  not  have  any  three  factor  Interact Idiis. 

12  6  nbiiiserb 

13  3 

14  9  nonzero 

23  2 

24  6 
34  3 

c.  Model: 

^tjkit  ■  «  ^  ^1(15  ^  ^2H)  ^  "3(lc5  ^  "4(0  ^  "l2(±  )  ^  "U(i£) 

Conditional  on  a  patient's  length  of  stay*  his  or  her  age  and 
preoperative  status  are  Independent. 
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Part  II. 


1.    Cdnfinnatory  techniques  make  distributional  assumptions  about  th^ 
data;    Based  on  these  assu^tiOTs^inferences  a^  concerning 
the  probabilities  of  various  outcomes,  mid  most  likely  parameter 
values;    Of  course,  if  the  distributions  are  not  accurate,  then 
the  inferences  are  invalid. 

Exploratory  techniques  do  not  make  a  priori  assumptions  about  the 
data;  Instead,  they  examine  the  relationships  among  the  data  arid 
attempt  to  "describe'*  the  data  based  oh  these  relationships. 

Thus,  even  if  distributional  assuiptlohs  are  not  true ^  exploratory 
techniques  may  still  be  able  to  describe,  summarize ^  and  fit  models 
to  data. 


2;    The  feature  that  distinguishes  these  tables  is  the  nature  of  the 
cell  entries.    A  two-way  table  has  cell  entries  that  are  values  of 
a  third  variable,  a  response  variable.    We  in  fact  use  the  row  and 
column  variables  to  "explain"  this  response  variable.    In  a  contin- 
gency table,  ceil  (i,j)  is  merely  a  count  of  the  number  of  occurences 
of  category  i  of  variable  A.  and  category  i  of  Variable  Aj. 


3;    Part  b 


6;    This  procedure  does  not  yield  a  represehtative  sample.  The 

point  in  the  list  must  Be  randomly  chosen.    A  better  procedure  would 
be  to  use  a  table  of  random  numbers  to  choose  all  individuals.  Of 
course,  spouses  are  not  selected  randdi^     a  spouse  has  a  probability 
of  tmity  6f  being  in  the  sample  if  his/her  spouse  is  included. 


5.    Part  a. 


6.    A  95%  confidence  interval  about  p  implies  that  if  we  obtained  N 
samples  and  estimated  p  in  each  saii^lei  and  construct  a  95%_ con- 
fidence interval  about  each       95%  of  the  intervals  will  contain 
true  p. 


7.  /xf(x)dx 
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Part  ill. 

1.  Fart  a 

2.  Fart  e 

3.  Fart  b 

4.  Fart  b 

5.  Part  c  or  e 

6.  Part  c 

7.  Part  d 

8.  Part  d 

9.  Part  e 
10.  Part  b 
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